|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Notes for the Chinese CorpusThe part-of-speech in the Chinese corpus is parsed initially by the Stanford Log-linear Part-Of-Speech Tagger and supplemented with a very careful proof-reading. The following is a brief explanation for the Part-of-Speech Tagging Guidelines according to Fei (2000a, 2000b). 1. VA: Predicative adjectiveVA roughly corresponds to adjectives in English, including predicates that have no object and can be modified by 很[very] and that are derived from the previous type via reduplication or the pattern N + A (e.g. 雪白[snow white]). 2. VC: CopulaThe words 是[be] and 为[be] are tagged as VC. 非is also tagged as VC if it means 不[not] 是[be] and there is no other verb in the sentence. 3. VE: you3 as the main verbOnly 有 [have], 没 [not]{有[have]}, and 无[not have] are tagged as VE when they are the main verbs (including the possessive you3, existential you3, etc.) 4. VV: Other verbsThis includes the rest of the verbs. 5. NR: Proper nounAn NR is a name of a particular person, politically or geographically defined location, or organization. 6. NT: Temporal nounTemporal Nouns can be the objects of prepositions such as 在[at], 从[since], 到[until], or 等到[until]. They can be referred by 这个时候[at this moment], and questioned by 什么时候[when]. 7. NN: Other nounNN includes all other nouns. 8. LC: LocalizerLocalizers are attached to the preceding NP/S to denote direction, location and so on. 为止[until], 开始[starting from], 来[ever since], 以来[since], 起[since], and 在内[inside] are also tagged as LCs. 9. PN: PronounPronouns function as substitutes for noun phrases, which include personal pronouns (e.g. 我[I], 你[you]), demonstratives used alone as NPs (e.g. 这[this]), possessive pronouns, and reflexives. 10. DT: DeterminerThis includes demonstratives (e.g., 这[this], 那[that], 该[the]) and words such as 每[every], 各[each], 前[the preceding], 后[the following]. 11. CD: Cardinal numberIt includes cardinal numbers (optionally followed by 概述词[approximate number indicators]) and words such as 好些[some], 若干[serval], 半[half], 许多[many], 很多[many]. 12. OD: Ordinal numberOrdinal numbers are tagged as ODs. 第+ CD is treated as one word, and tagged as OD. 13. M: Measure wordMeasure words include classifiers (e.g., 个), group measure words (e.g., 群 [group]), and words such as 公里 [kilometre] and 升 [litre]. 14. AD: AdverbThis includes manner adverbs, frequency adverbs, degree adverbs, conjunctive adverbs. 15. P: PrepositionA preposition can take a noun phrase or a clause as its argument. 16. CC: Coordinating conjunctionA coordinating conjunction (CG) conjoins two constituents with the same function. 17. CS: Subordinating conjunctionWords that join two clauses, one subordinating to the other, are tagged as subordinating conjunctions (CS). 18. DEC: de5 as a complementizer or a nominalizerThis only includes 的and 之when they function as a complementizer or a nominalizer (e.g., 吃[eat] 的/DEC). 19. DEG: de5 as a genitive marker and an associative markerThis only includes 的and 之when they function as a genitive marker or an associative marker. 20. DER: Resultative de5de5(得) is tagged as DER in potential form V-得-R, and in V-de construction (他[he] 跑[run] 很/DER 得[very] 快[fast] <He runs very fast>). 21. DEV: Manner de5This only includes 地when it occurs in “XP 地VP”, where XP modifies the VP. “的” is sometimes also used in this pattern. 22. AS: Aspect particleVerbal particles that indicate aspect are tagged as aspect particles (AS), which only includes 了,着,过,and 的. 23. SP: Sentence final particleSP often appears at the end of a sentence. For example, 他[he], 好[good] 吧[SP]? 24. ETCThe tag is used for the word 等and 等等. 25. MSPThis includes particles, such as所, 以, 来, 而, when they appear before a VP. 26. IJ: InterjectionInterjections appear in the sentence initial position (e.g. 啊[Ah]). 27. ON: OnomatopoeiaThe term 象声词 [onomatopoeia], a word that imitates sounds (e.g. 哗哗[ON]). 28. LB: bei4 in long bei-constructionThis only includes 被, 叫, 给, and wei2(为) when they occur in the long bei-construction (e.g. 他 [he] 被/LB我 [I] 训[scold] 了/AS 一[one] 顿/M). 29. SB: bei4 in short bei-constructionThis only includes 被 and 给 when they occur in the short bei-construction (e.g. 他[he] 被/SB 训[scold] 了/AS 一[one] 顿/M). 30. BAThis only includes 把 and 将when they occur in the ba-construction (e.g. 他[he] 把/BA 你[you] 骗[cheat] 了/AS). 31. JJJJ includes two types of adjectives: a. “区别词” to modify nouns in the pattern JJ+的+{N} or JJ + N; b. “Hyphenated-compound”, shortened forms of relative clauses with two syllables (e.g. 留美[having studied in the US]/JJ scholar/NN). 32. Foreign Word: FWFW is used to tag foreign words. 33. Punctuation: PUPunctuation marks are tagged as PU. For detailed information, please refer to the following references on the software and the guidelines. ReferencesThe Stanford Natural Language Processing Group (2015). Stanford Word Segmenter (Version 3.6.0) [Software]. Available from http://nlp.stanford.edu/software/segmenter.html The Stanford Natural Language Processing Group (2015). Stanford Log-linear Part-Of-Speech Tagger (Version3.6.0) [Software]. Available from http://nlp.stanford.edu/software/tagger.html Fei, Xia. (2000a). The segmentation guidelines for the Penn Chinese Treebank (3.0). IRCS Technical Reports Series. Retrieved froSm: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1038&context=ircs_reports Fei, Xia. (2000b). The part-of-speech tagging guidelines for the Penn Chinese Treebank (3.0). IRCS Technical Reports Series. Retrieved from: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports Summary of the Treebank Part-of-Speech TagsetTable 1: POS tagset in alphabetical order, from (Fei 2000b, pp.37)
User Manual for the English CorpusThis is a manual for users to search the corpus. Users are free to select specific functions among the following steps according to the research needs. Keyword Search1. Input a word or phrase you want to search in the corpus, e.g. “freedom”. 2. Click the Source needed, either Hong Kong or United States. Then check the relevant boxes to specify the sources you want to search. 3. Click the Speaker needed, as shown below. 4. Choose the time range either by selecting the boxes or by dragging the green button to limit the retrieved data to a particular timeframe. 5. Press the Search button to get the result page. 6. View your search results with both No. of Occurrence in the relevant corpora and Graphical Representation to show the relevant percentage among all the search results. 7. You can sort out the Keyword in Context by Year/Region/Type/Speaker. 8. You can also sort out your Keyword in Context by L1-L5 & R1-R5 (e.g. L1 is the first word to the left of the keyword). 9. Click on any interested entries in step 8 to view Keyword in Expanded Context. 10. Click the button Download “Keyword in Context” to extract your data to an Excel document. Collocation FunctionTwo approaches are available for the collocation function: Top 50 words, and the single collocate. 1. Top 50 wordsa. Check the Collocation box, select “Top 50 words”, and then choose a number to specify its distance from the keyword, e.g. “1 & Before”. b. Similarly, choose and specify the source, speaker, and time to retrieve data according to your research needs, e.g. “the SOU corpus”. c. View your search results with both No. of Occurrence and Graphical Representation, and then choose a collocate you need. d. Output your results to Excel by clicking Download “Keyword in Context”. 2. Single collocatea. Check the Collocation box, input a single collocate, e.g. “for”, and then choose a number to specify its distance and position to the keyword, e.g. “1 & After”. b. Similar to the previous guidelines, choose and specify the source, speaker, and time to retrieve data according to your research needs, e.g. “the SOU corpus”. c. View your search results with both No. of Occurrence and Graphical Representation. d. Output your results to Excel by clicking Download “Keyword in Context”.
User Manual for the Chinese CorpusKeyword Search1. Input a word or phrase you want to search in the corpus, e.g. “社会(society)”. 2. Click “语料出处(source)” needed. 3. Click “讲者(speaker)”, and then check the relevant boxes to specify the speakers you want to search. 4. Specify the time range either by selecting the boxes or by dragging the green button to limit the retrieved data to a particular timeframe. 5. Press the “搜索(search)” button to get the result page. 6. View your search results with both No. of Occurrence in the relevant corpora and Graphical Representation to show the relevant percentage among all the search results. 7. You can sort out “语句 (keyword in context)” by “年份(year)” / “地区(region)” / “类别(type)” / “讲者(speaker)”. 8. You can also sort out your “语句 (keyword in context)” by L1-L5 & R1-R5 (e.g. L1 is the first word to the left of the keyword). 9. Click on any interested entries in step 8 to view “语句详细出处 (keyword in expanded context)”. 10. Click the button “下载语句 (download keyword in context)” to extract your data to an Excel document. Collocation FunctionAfter step 6, the collocation function becomes available in two approaches: either “首20词频 (top 20 words)” or “单匹配词 (single collocate)”. 1. 首20词频 (Top 20 Words)a. Check the “匹配词语(collocation)” box, select “首20词频 (top 20 words)”, and then choose a number to specify its distance and position to the keyword, e.g. “1 & before”. b. View advanced search results with No. of Occurrence and Graphical Representation, and then choose a collocate you need. c. Output your results to Excel by clicking “下载语句 (download keyword in context)”. 2. 单匹配词 (Single Collocate)a. Check “匹配词语(collocation)” box, input a single collocate with part of speech specified, e.g. “治安(security, n.)”, and then specify its distance and position to the keyword (e.g. “1 & After”). b. Click “继续搜索 (continue to search)” and view “语句 (keyword in context)” for further results. c. Output your results to Excel by clicking “下载语句 (download keyword in context)”.
Frequently Asked Questions1. What is a corpus?A corpus is defined as a compilation of different written text or vocabulary, focusing on the works from an author on a specific subject. This HKBU Corpus of Political Speeches is an important resource for all those interested in the study of political rhetoric. The data can be searched for within a particular time frame to facilitate analysis of diachronic language change. This Corpus provides empirical language data, which can serve as a reference for language learning, especially to those who are keen on understanding how politicians structure their speeches to win support from the public. 2. How do I search for a term in the corpus website?User manuals are provided as a guide for users to do keyword searches in both the English and Chinese corpus. 3. Is the full written text of sources available for any results of a specific term?If an URL link is provided in the “Keyword in Expanded Context” of the result page, then you can access the full written text of the relevant sources. 4. Can I download the search results?Yes, you can always download and access your search results in an excel document, by clicking the button Download “Keyword in Context” for corpus in English, and 下载语句 for Chinese ones. 5. What is the difference between the traditional Chinese and the simplified Chinese corpus?For the traditional Chinese corpus, the sources originate from Hong Kong and Taiwan; and for the simplified Chinese corpus, the sources are from the People’s Republic of China (PRC). 6. What is the use of the collocation?The collocation function allows you to see the collocation pattern in respect of the keyword, either by specifying a distance or a specific collocate. 7. What should I do if I do not want to use the collocation function?You can just skip the collocation box by not checking it. 8. Where is the information of the Word Frequency Data from?Word Frequency Data is generated by the “WordList” function via the WordSmith Tools 6.0 to extract the top 100 most frequently used words. Relevant detailed explanation can be found via this link: http://lexically.net/wordsmith/step_by_step_English6/index.html?makingawordlist.htm 9. Why is it not recommended to use this website for commercial purposes?This corpus is mainly for educational purposes to enhance students and researchers’ interest and knowledge in political discourse. 10. How do I cite the website?Ahrens, Kathleen. (2015). Corpus of Political Speeches. Hong Kong Baptist University Library, Retrieved Date Accessed, from https://digital.lib.hkbu.edu.hk/corpus/ |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||