Data Specification

Data collection

The Dashboard stores and presents data of the Facebook public pages of 17 news outlets in Hong Kong from January 2019 to June 2021, including the text in posts, text in comments, and URL links, in addition to user engagement counts, such as frequencies of likes, shares, and post reactions.

The Facebook public pages chosen to be included in the Dashboard had more than 300,000 followers at the time when the project began. As HKG was removed in January 2020 and HKG 2.0 was created soon afterwards, the Dashboard includes data from both HKG and HKG 2.0.

To protect user privacy, none of the data on the Dashboard will include any personal information of the users, and only the content from public pages will be shown. Moreover, the names of around 1,670 journalists, hosts, photographers, speakers, etc., have been removed.

Considering how the data were collected and Facebook's data policy issues, missing data exist. Known missing data are reported here.

The data structure is shown in the figure below:

Removal of Advertisements

After examining the data, two types of advertisements were observed: One was promotional content appearing after the main text in a post, usually with an intent to attract users' subscriptions and interactions, such as "Remember to follow our Instagram" ("記得Follow埋我哋嘅IG") and "Please support us and subscribe now" ("請支持我們,立即訂閲"). The second type was advertisements that were independent, mainly published to announce any discounts on products and to promote a business, such as "Discount | kitchenware and tableware are up to 60% off" ("減價|厨具餐具低至4折").

For the first type of advertisements, the promotional content usually comes after apparent markers in a post, such as "===========" or "************", or specific terms such as "Subscribe to our Telegram" ("訂閱Telegram"). Our team defined 12 regular expressions to clean this type of advertisement.

The second type of advertisements appear in a scattered manner. Our team tried filtering using keywords, but this filtered out by mistake many posts that were not ads. During our experiment, we defined 251 keywords in all by reading the original posts individually, which included phrases such as "Buy a gift in summer, ten times rewards" ("夏日送禮十激賞") and "Exclusive in Hong Kong" ("全港獨家"). However, considering this type of advertisement was not very frequent, we decided to let them remain to ensure the completeness of the original content.

Finally, to clean the advertisement content more extensively, we used three additional search criteria, "===", "---", and "***" as indications of the existence of footers; we then removed them and the subsequent characters.

Media Types

The Dashboard presents public Facebook page data of 17 news media outlets in Hong Kong, and the pages are categorized by their communication channels: "paid newspapers", "free newspapers", "electronic media", and "online news media". For post authors whose presence is redundant, the pages are categorized as "undefined". The four major media types are as follows:

Paid Newspapers: Apple Daily, TOPick News, on.cc/Oriental Daily News, Hong Kong Economic Times hket.com, Ming Pao
Free Newspapers: Sky Post, am730
Electronic Media: Cable News Hong Kong i-Cable News, Radio Television Hong Kong RTHK News, Now News - News
Online News Media: Bastille Post, HK01, Initium Media, HKGPao, Stand News, inmediahk.net

Categories of "Mentions of Fake News"

Given that this project is interested in how people use the term "fake news" instead of distinguishing whether a Facebook post presents true or false information, the Dashboard focuses on whether a user mentioned the term "fake news" when they commented on a news post. To identify posts with mentions of fake news, a list keywords were generated to identify these mentions.

First, with the help of the Python library of synonyms, 10 words with the highest similarity to "fake news" ("假新聞") and "rumor" ("謠言") were identified. The words that did not match the meaning of those two words were then removed. Leveraging both computational and manual approaches, a rather comprehensive list of keywords was generated. Keyword retrieval was conducted using regular expression, and the frequency of each keyword was counted. This method yielded three categories of keywords, as follows:

Fake news: 假新聞, fake news, etc.
Disinformation: 不實, 惡意, 捏造, 騙局, 蓄意, 不負責任, 造謠, etc.
Rumors: 傳言, 流言, 傳聞, 謠傳, 假消息, etc.

Post Themes

In terms of the themes of interest for this project, the Dashboard identifies themes according to whether a keyword was retrieved in a news post. Keywords were first generated by reading the news posts individually and the related keywords were identified using regular expressions, counting the frequency of each identified keyword. Finally, news posts were categorized using the keywords below:

COVID-19: COVID-19 (新型冠状), coronavirus (冠狀病毒, pandemic (肺炎疫情), restriction order (限聚令), confirmed cases (確診), testing (檢測), quarantine (檢疫), etc.
National Security Law: National Security Law (國安法), the Hong Kong national security law (港區國安法)
Extradition Law Amendment Bill: Extradition law (引渡條例), Fugitive Offenders Ordinance (逃犯條例), emergency law (緊急法), five demands (五大訴求), Anti-Extradition Law Amendment Bill (反送中)等
Carrie Lam Cheng Yuet-ngor: Lam Cheng Yuet-ngor (林鄭月娥), Lam Cheng (林鄭), chief executive (行政長官), 777, etc.
Hong Kong Police Force: Hong Kong Police (香港警察), Hong Kong police officer (香港警員), Hong Kong police force (香港警方), etc.
U.S. Election: U.S. election (美國選舉), U.S. Presidential Election (美國大選), Trump (特朗普), Biden (拜登), etc.
Donald Trump: Donald Trump (特朗普), Trump, etc.
U.S. Police: U.S. police (美國警察), U.S. police officer (美國警員), etc.