Notes for the Chinese Corpus

The part-of-speech in the Chinese corpus is parsed initially by the Stanford Log-linear Part-Of-Speech Tagger and supplemented with a very careful proof-reading. The following is a brief explanation for the Part-of-Speech Tagging Guidelines according to Fei (2000a, 2000b).


1. VA: Predicative adjective

VA roughly corresponds to adjectives in English, including predicates that have no object and can be modified by 很[very] and that are derived from the previous type via reduplication or the pattern N + A (e.g. 雪白[snow white]).


2. VC: Copula

The words 是[be] and 为[be] are tagged as VC. 非is also tagged as VC if it means 不[not] 是[be] and there is no other verb in the sentence.


3. VE: you3 as the main verb

Only 有 [have], 没 [not]{有[have]}, and 无[not have] are tagged as VE when they are the main verbs (including the possessive you3, existential you3, etc.)

4. VV: Other verbs

This includes the rest of the verbs.

5. NR: Proper noun

An NR is a name of a particular person, politically or geographically defined location, or organization.

6. NT: Temporal noun

Temporal Nouns can be the objects of prepositions such as 在[at], 从[since], 到[until], or 等到[until]. They can be referred by 这个时候[at this moment], and questioned by 什么时候[when].

7. NN: Other noun

NN includes all other nouns.

8. LC: Localizer

Localizers are attached to the preceding NP/S to denote direction, location and so on. 为止[until], 开始[starting from], 来[ever since], 以来[since], 起[since], and 在内[inside] are also tagged as LCs.

9. PN: Pronoun

Pronouns function as substitutes for noun phrases, which include personal pronouns (e.g. 我[I], 你[you]), demonstratives used alone as NPs (e.g. 这[this]), possessive pronouns, and reflexives.

10. DT: Determiner

This includes demonstratives (e.g., 这[this], 那[that], 该[the]) and words such as 每[every], 各[each], 前[the preceding], 后[the following].

11. CD: Cardinal number

It includes cardinal numbers (optionally followed by 概述词[approximate number indicators]) and words such as 好些[some], 若干[serval], 半[half], 许多[many], 很多[many].

12. OD: Ordinal number

Ordinal numbers are tagged as ODs. 第+ CD is treated as one word, and tagged as OD.

13. M: Measure word

Measure words include classifiers (e.g., 个), group measure words (e.g., 群 [group]), and words such as 公里 [kilometre] and 升 [litre].

14. AD: Adverb

This includes manner adverbs, frequency adverbs, degree adverbs, conjunctive adverbs.

15. P: Preposition

A preposition can take a noun phrase or a clause as its argument.

16. CC: Coordinating conjunction

A coordinating conjunction (CG) conjoins two constituents with the same function.

17. CS: Subordinating conjunction

Words that join two clauses, one subordinating to the other, are tagged as subordinating conjunctions (CS).

18. DEC: de5 as a complementizer or a nominalizer

This only includes 的and 之when they function as a complementizer or a nominalizer (e.g., 吃[eat] 的/DEC).

19. DEG: de5 as a genitive marker and an associative marker

This only includes 的and 之when they function as a genitive marker or an associative marker.

20. DER: Resultative de5

de5(得) is tagged as DER in potential form V-得-R, and in V-de construction (他[he] 跑[run] 很/DER 得[very] 快[fast] <He runs very fast>).

21. DEV: Manner de5

This only includes 地when it occurs in “XP 地VP”, where XP modifies the VP. “的” is sometimes also used in this pattern.

22. AS: Aspect particle

Verbal particles that indicate aspect are tagged as aspect particles (AS), which only includes 了,着,过,and 的.

23. SP: Sentence final particle

SP often appears at the end of a sentence. For example, 他[he], 好[good] 吧[SP]?

24. ETC

The tag is used for the word 等and 等等.

25. MSP

This includes particles, such as所, 以, 来, 而, when they appear before a VP.

26. IJ: Interjection

Interjections appear in the sentence initial position (e.g. 啊[Ah]).

27. ON: Onomatopoeia

The term 象声词 [onomatopoeia], a word that imitates sounds (e.g. 哗哗[ON]).

28. LB: bei4 in long bei-construction

This only includes 被, 叫, 给, and wei2(为) when they occur in the long bei-construction (e.g. 他 [he] 被/LB我 [I] 训[scold] 了/AS 一[one] 顿/M).

29. SB: bei4 in short bei-construction

This only includes 被 and 给 when they occur in the short bei-construction (e.g. 他[he] 被/SB 训[scold] 了/AS 一[one] 顿/M).

30. BA

This only includes 把 and 将when they occur in the ba-construction (e.g. 他[he] 把/BA 你[you] 骗[cheat] 了/AS).

31. JJ

JJ includes two types of adjectives: a. “区别词” to modify nouns in the pattern JJ+的+{N} or JJ + N; b. “Hyphenated-compound”, shortened forms of relative clauses with two syllables (e.g. 留美[having studied in the US]/JJ scholar/NN).

32. Foreign Word: FW

FW is used to tag foreign words.

33. Punctuation: PU

Punctuation marks are tagged as PU.


For detailed information, please refer to the following references on the software and the guidelines.


References

The Stanford Natural Language Processing Group (2015). Stanford Word Segmenter (Version 3.6.0) [Software]. Available from http://nlp.stanford.edu/software/segmenter.html


The Stanford Natural Language Processing Group (2015). Stanford Log-linear Part-Of-Speech Tagger (Version3.6.0) [Software]. Available from http://nlp.stanford.edu/software/tagger.html


Fei, Xia. (2000a). The segmentation guidelines for the Penn Chinese Treebank (3.0). IRCS Technical Reports Series. Retrieved froSm: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1038&context=ircs_reports


Fei, Xia. (2000b). The part-of-speech tagging guidelines for the Penn Chinese Treebank (3.0). IRCS Technical Reports Series. Retrieved from: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports


Summary of the Treebank Part-of-Speech Tagset

Table 1: POS tagset in alphabetical order, from (Fei 2000b, pp.37)

ADadverb
ASaspect marker
BA把 in ba-construction把,将
CCcoordinating conjunction
CDcardinal number一百
CSsubordinating conjunction虽然
DEC的 in a relative-clause
DEGassociative 的
DER得 in V-de const. and V-de-R
DEC地before VP
DTdeterminer
ETCfor words 等,等等等,等等
FWforeign wordsISO
IJinterjection
JJother noun-modifier男,共同
LB被 in long bei-const被,给
LClocalizer
Mmeasure word
MSPother particle
NNcommon noun
NRproper noun美国
NTtemporal noun今天
ODordinal number第一
ONonomatopoeia哈哈,哗哗
Ppreposition excl. 被 and 把
PNpronoun
PUpunctuation、?。
SB被 in short bei-const被,给
SPsentence-final particle
VApredicative adjective
VC
VE有 as the main verb
VVother verb


User Manual for the English Corpus


This is a manual for users to search the corpus. Users are free to select specific functions among the following steps according to the research needs.


Keyword Search

1. Input a word or phrase you want to search in the corpus, e.g. “freedom”.



2. Click the Source needed, either Hong Kong or United States. Then check the relevant boxes to specify the sources you want to search.



3. Click the Speaker needed, as shown below.



4. Choose the time range either by selecting the boxes or by dragging the green button to limit the retrieved data to a particular timeframe.



5. Press the Search button to get the result page.



6. View your search results with both No. of Occurrence in the relevant corpora and Graphical Representation to show the relevant percentage among all the search results.



7. You can sort out the Keyword in Context by Year/Region/Type/Speaker.



8. You can also sort out your Keyword in Context by L1-L5 & R1-R5 (e.g. L1 is the first word to the left of the keyword).



9. Click on any interested entries in step 8 to view Keyword in Expanded Context.



10. Click the button Download “Keyword in Context” to extract your data to an Excel document.





Collocation Function

Two approaches are available for the collocation function: Top 50 words, and the single collocate.


1. Top 50 words

a. Check the Collocation box, select “Top 50 words”, and then choose a number to specify its distance from the keyword, e.g. “1 & Before”.



b. Similarly, choose and specify the source, speaker, and time to retrieve data according to your research needs, e.g. “the SOU corpus”.



c. View your search results with both No. of Occurrence and Graphical Representation, and then choose a collocate you need.



d. Output your results to Excel by clicking Download “Keyword in Context”.




2. Single collocate

a. Check the Collocation box, input a single collocate, e.g. “for”, and then choose a number to specify its distance and position to the keyword, e.g. “1 & After”.



b. Similar to the previous guidelines, choose and specify the source, speaker, and time to retrieve data according to your research needs, e.g. “the SOU corpus”.



c. View your search results with both No. of Occurrence and Graphical Representation.



d. Output your results to Excel by clicking Download “Keyword in Context”.



 


User Manual for the Chinese Corpus


Keyword Search

1. Input a word or phrase you want to search in the corpus, e.g. “社会(society)”.



2. Click “语料出处(source)” needed.



3. Click “讲者(speaker)”, and then check the relevant boxes to specify the speakers you want to search.



4. Specify the time range either by selecting the boxes or by dragging the green button to limit the retrieved data to a particular timeframe.



5. Press the “搜索(search)” button to get the result page.



6. View your search results with both No. of Occurrence in the relevant corpora and Graphical Representation to show the relevant percentage among all the search results.



7. You can sort out “语句 (keyword in context)” by “年份(year)” / “地区(region)” / “类别(type)” / “讲者(speaker)”.



8. You can also sort out your “语句 (keyword in context)” by L1-L5 & R1-R5 (e.g. L1 is the first word to the left of the keyword).



9. Click on any interested entries in step 8 to view “语句详细出处 (keyword in expanded context)”.



10. Click the button “下载语句 (download keyword in context)” to extract your data to an Excel document.





Collocation Function

After step 6, the collocation function becomes available in two approaches: either “首20词频 (top 20 words)” or “单匹配词 (single collocate)”.


1. 首20词频 (Top 20 Words)

a. Check the “匹配词语(collocation)” box, select “首20词频 (top 20 words)”, and then choose a number to specify its distance and position to the keyword, e.g. “1 & before”.



b. View advanced search results with No. of Occurrence and Graphical Representation, and then choose a collocate you need.



c. Output your results to Excel by clicking “下载语句 (download keyword in context)”.




2. 单匹配词 (Single Collocate)

a. Check “匹配词语(collocation)” box, input a single collocate with part of speech specified, e.g. “治安(security, n.)”, and then specify its distance and position to the keyword (e.g. “1 & After”).



b. Click “继续搜索 (continue to search)” and view “语句 (keyword in context)” for further results.



c. Output your results to Excel by clicking “下载语句 (download keyword in context)”.



 


Frequently Asked Questions



1. What is a corpus?

A corpus is defined as a compilation of different written text or vocabulary, focusing on the works from an author on a specific subject. This HKBU Corpus of Political Speeches is an important resource for all those interested in the study of political rhetoric. The data can be searched for within a particular time frame to facilitate analysis of diachronic language change. This Corpus provides empirical language data, which can serve as a reference for language learning, especially to those who are keen on understanding how politicians structure their speeches to win support from the public.


2. How do I search for a term in the corpus website?

User manuals are provided as a guide for users to do keyword searches in both the English and Chinese corpus.


3. Is the full written text of sources available for any results of a specific term?

If an URL link is provided in the “Keyword in Expanded Context” of the result page, then you can access the full written text of the relevant sources.


4. Can I download the search results?

Yes, you can always download and access your search results in an excel document, by clicking the button Download “Keyword in Context” for corpus in English, and 下载语句 for Chinese ones.


5. What is the difference between the traditional Chinese and the simplified Chinese corpus?

For the traditional Chinese corpus, the sources originate from Hong Kong and Taiwan; and for the simplified Chinese corpus, the sources are from the People’s Republic of China (PRC).


6. What is the use of the collocation?

The collocation function allows you to see the collocation pattern in respect of the keyword, either by specifying a distance or a specific collocate.


7. What should I do if I do not want to use the collocation function?

You can just skip the collocation box by not checking it.


8. Where is the information of the Word Frequency Data from?

Word Frequency Data is generated by the “WordList” function via the WordSmith Tools 6.0 to extract the top 100 most frequently used words. Relevant detailed explanation can be found via this link: http://lexically.net/wordsmith/step_by_step_English6/index.html?makingawordlist.htm


9. Why is it not recommended to use this website for commercial purposes?

This corpus is mainly for educational purposes to enhance students and researchers’ interest and knowledge in political discourse.


10. How do I cite the website?

Ahrens, Kathleen. (2015). Corpus of Political Speeches. Hong Kong Baptist University Library, Retrieved Date Accessed, from https://digital.lib.hkbu.edu.hk/corpus/