About the CEPIC


Description

The Chinese/English Political Interpreting Corpus (CEPIC), with about 6.5 million word tokens in size, is designed for the study of Chinese/English political interpreting and translation. It consists of transcripts of speeches delivered by top political figures from Hong Kong, Beijing, Washington DC and London, as well as their translated/interpreted texts.

The main speech types of CEPIC include the reading of government reports such as policy addresses and budget speeches, Q&A at press conferences, parliamentary debates, as well as remarks delivered at bilateral meetings (For details, please refer to the section Basic Statistics). In particular, speeches in the Hong Kong subset were mostly interpreted from Cantonese into Putonghua and English, and those in the Beijing subset Putonghua to English. The other two subsets, i.e., Washington DC and London, mainly includes English speeches delivered in similar settings and can be regarded as reference subsets to the interpreted English speeches.

Data of CEPIC were collected in two ways: 1) Speech transcripts and their translations collected from government websites (“Raw”); and 2) A revised or newly transcribed version (when there are no readily available transcripts) of these speeches and their interpreted texts based on audios/videos collected from government websites and TV programme archives (“Annotated”).

The corpus features a parallel display of up to six versions of the same speech segment, aligned at paragraph level. Apart from POS tagging, the corpus is also annotated with different prosodic and paralinguistic features that are of concern to the study of spoken language as well as interpreting.

The CEPIC can be used to investigate matters relating to Chinese/English political translation/interpreting and political discourse at large. It can also serve students, teachers, as well as people working in political settings, in aspects of political speech delivery and translation/interpreting production. Users can also download search results from the corpus for their own teaching/research purposes.

The CEPIC consists of parallel representation of speech segments in Cantonese, Putonghua and English. The following table shows the number of words (word token) and unique words (type) in each language.

Table 1. The composition of the CEPIC by language

Word (Word Token) Unique Word (Type)
Chinese 2,578,911 83,312
   Cantonese
   Putonghua
   1,072,368
   1,506,541
   61,837
   30,320
English 3,815,083 32,748
Total 6,393,994 116,060

The main speech types of CEPIC include the reading of government reports such as policy addresses and budget speeches, Q&A at press conferences, parliamentary debates, as well as remarks delivered at bilateral meetings. The following table shows the current composition of the corpus and some basic statistics of each subset.

Table 2. The composition of the CEPIC by speech types

Speech Type Word (Word Token)
1 HK SAR Policy Addresses (HKPA) 1,290,774
2 Press Conferences of HK SAR Policy Addresses (HKPAPC) 326,194
3 HK SAR Budget Speeches (HKBS) 1,167,530
4 Press Conferences of HK SAR Budge Speeches (HKBSPC) 419,236
5 PRC Reports on the Work of the Government (PRCWoG) 782,794
6 Press Conferences of PRC Reports on the Work of the Government (PRCWoGPC) 448,111
7 US State of the Union Addresses (USSoUA) 275,018
8 Press Conferences of US State of the Union Addresses (USSoUAPC) 266,639
9 US Budget Speeches (USBS) 73,115
10 Press Conferences of US Budget Speeches (USBSPC) 328,850
11 UK State Opening Addresses of Parliament (UKSOoP) 31,006
12 Debates on the UK State Opening Addresses of Parliament (UKSOoPD) 53,941
13 UK Budget Speeches (UKBS) 469,452
14 Debates on the UK Budget Speeches (UKBSD) 376,721
15 Bilateral Meetings between PRC Key Politicians and their Counterparts in US (BMPRCUS) 70,138
16 Bilateral Meetings between PRC Key Politicians and their Counterparts in UK (BMPRCUK) 14,473
Total 6,393,994
Back to Top

The following is a list of words that have specific meaning in the CEPIC.

  1. Annotated: It refers to a revised or newly transcribed version (when there are no readily available transcripts) of political speeches and their interpreted texts based on audios/videos collected from government websites and TV programme archives (also see Raw). The texts were annotated with prosodic and paralinguistic features that are of concern to the study of spoken language as well as interpreting.
  2. CI: It refers to Consecutive Interpreting, a mode of interpreting employed in political settings when the interpreter speaks after the speaker finishes (part of) his/her speech (also see Interpreting Mode and SI).
  3. Delivery Mode: It refers to the ways how a speech is delivered and includes mainly Monologue and Dialogue.
  4. Dialogue: It refers to the Delivery Mode in which one or more political figures debate with or answer questions from the audience or each other.
  5. Interpreter Language: It refers to the language that the interpreter interprets into.
  6. Interpreting Mode: There are usually two modes of interpreting employed in political settings, i.e., CI and SI.
  7. Location: It refers to the (capital) city where the political speech/interpreting is delivered.
  8. Monologue: It refers to the Delivery Mode in which a political figure delivers the speech to the audience, with almost no interactions with the audience or other political figures.
  9. Part of Speech (POS): It refers to the grammatical category a word belongs to, which is also called a word class. The CEPIC is POS tagged, i.e., each word is tagged with its part of speech. (Also refer to the section POS tagging for a full list of POS taggers used in the CEPIC).
  10. Raw: It refers to the speech transcripts and their translations collected from government websites, with no annotations of prosodic or paralinguistic features (also see Annotated).
  11. SI: It refers to Simultaneous Interpreting, a mode of interpreting employed in political settings when the interpreter speaks (usually in an interpreting booth) while the speaker deliveries his/her speech (also see CI and Interpreting Mode).
  12. Speaker Language: It refers to the language that the speaker uses.
  13. Speaker Role: It refers to the position of the speaker in a government.

The CEPIC is POS tagged with the assistance of Stanford CoreNLP 3.9.2 (Manning et al. 2014).

A semi-automatic process was employed to enhance the accuracy rate of machine tagging, in which all taggers were checked and revised based on subsets of manually checked testing data. Please click here(available soon) for a detailed account of the semi-automatic process employed in the POS tagging of CEPIC.

The following table provides a list of the POS taggers that appeared in the English subset of CEPIC, which is based on the Part-of-Speech Tagging Guidelines for the Penn Treebank Project (Santorini 1990, 6-7).

Table 1. POS taggers that appeared in the English subset of CEPIC (based on Santorini 1990: 6-7)

POS tagger Description
/CC Coordinating conjunction
/CD Cardinal number
/DT Determiner
/EX Existential there
/FW Foreign word
/IN Preposition or subordinating conjunction
/JJ Adjective
/JJR Adjective, comparative
/JJS Adjective, superlative
/LRB Open parenthesis
/LS List item marker
/MD Modal
/NN Noun, singular or mass
/NNP Noun, plural
/NNPS Proper noun, singular
/NNS Proper noun, plural
/PDT Predeterminer
/POS Possessive ending
/PRP Personal pronoun
/PRP$ Possessive pronoun
/PU Punctuation
/RB Adverb
/RBR Adverb, comparative
/RBS Adverb, superlative
/RP Particle
/RRB Close parenthesis
/SYM Symbol
/TO to
/UH Interjection
/VB Verb, base form
/VBD Verb, past tense
/VBG Verb, gerund or present participle
/VBN Verb, past participle
/VBP Verb, non-3rd person singular present
/VBZ Verb, 3rd person singular present
/WDT Wh-determiner
/WP Wh-pronoun
/WP$ Possessive wh-pronoun
/WRB Wh-adverb

The following table provides a list of the POS taggers that appeared in the Chinese subset of CEPIC, which is based on the Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) (Xia 2000: 37).

Table 2. POS taggers that appeared in the Chinese subset of CEPIC (based on Xia 2000: 37)

POS tagger Description
/AD Adverb
/AS Aspect Particle
/BA 把 in ba-construction:
/CC Coordinating conjunction
/CD Cardinal number
/CS Subordinating conjunction
/DEC 的 as a complementizer or a nominalizer
/DEG 的as a genitive marker and an associative marker
/DER Resultative得
/DEV Manner地 (before VP)
/DT Determiner
/ETC For words 等, 等等
/FW Foreign words
/IJ Interjection
/JJ Other noun-modifer
/LB 被 in long bei-construction
/LC Localizer
/LRB Open parenthesis
/M Measure word
/MSP Other particle
/NN Common noun
/NR Proper noun
/NT Temporal noun
/OD Ordinal number
/P Preposition
/PN Pronoun
/PU Punctuation
/RRB Close parenthesis
/SB 被 in short bei-construction
/SP Sentence-final particle
/VA Predicative adjective
/VC Copula
/VE 有 as the main verb
/VV Other verb

References:

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

Santorini, B. (1990). Part-Of-Speech Tagging Guidelines for the Penn Treebank Project (3rd revision, 2nd printing). Department of Linguistics, University of Pennsylvania.

Xia, Fei. (2000). The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0). IRCS Technical Reports Series. 38.

Back to Top

Data of CEPIC were collected following specific steps and protocols. In particular, the “annotated” version of the CEPIC corpus was transcribed and annotated in a way that reflects features of spoken language data.

Speeches of CEPIC were manually revised or transcribed based on audios/videos with the speeches and their interpreting, if any. Apart from following a standardised process, the transcription of CEPIC aims to represent the spoken text as close as it was delivered. In addition, all Cantonese texts were transcribed in a way to capture spoken Cantonese features. Text and audio/video links were also included for those who may be interested in the sources of the speeches.

The following table shows the differences between the “raw” and “annotated” data.

Raw Annotated
Cantonese 我們要重新認識八十年代關於香港發展的一些重要概念,放棄二元對立分析方法。(HKSAR Policy Address, 2008-10-15) [...][我要]我哋要重新認識八十年代關於香港發展嘅一啲嘅重要嘅概念,係放棄二元對立嘅分析嘅方法。(HKSAR Policy Address, 2008-10-15)
Putonghua 在元朝有一位画家叫黄公望,他画了一幅著名的《富春江居图》。(Press Conference of PRC Report on the Work of the Government, 2010-03-14) 在元朝 [...]有一位画家[...]叫黄公望,他画了一幅[...]著名的 [...]啊 [...]《富春江居图》。(Press Conference of PRC Report on the Work of the Government, 2010-03-14)
English So that is the big difference in our approach and the approach that I think might have been debated about. (Press Conference of US Budget Speech, 1997-02-06) [er] So [that] that is the big difference [er] in our approach and the approach [er] that [er] I think [er] might have been debated about. (Press Conference of US Budget Speech, 1997-02-06)

As can be seen from the above table, the “annotated” version features annotations of different prosodic and paralinguistic features (e.g., pauses, fillers, repetitions and self-repair, etc.) that are of concern to the study of spoken language as well as interpreting.

Please click here(available soon) for a detailed account of the steps and protocols used in the transcription and annotation of CEPIC.


Word Frequency Data

This section shows the most frequently used words in CEPIC and in its subsets by language and speech type. The data were extracted with the help of the lexical analysis software WordSmith 7.0.

To accommodate a wide range of research/teaching purposes, the word frequency data were generated with no stop word list.

References:
Scott, M. (2016). WordSmith Tools 7.0. Accessed from https://lexically.net/wordsmith/ (Accessed on 10 September 2017).

NWordFreq.
1123,613
232,017
326,392
426,347
523,998
620,961
7政府20,632
8香港20,575
919,748
10我们19,147
1115,344
1214,758
13发展12,239
14#12,205
1511,938
1611,446
1710,896
1810,741
1910,234
20经济9,863
219,769
228,965
238,830
248,740
258,604
267,742
27方面7,565
287,494
29可以7,437
307,252
317,242
327,189
33我們7,167
34工作7,118
357,020
36提供6,957
376,896
386,867
39增加6,851
406,819
41措施6,761
42社会6,473
436,206
446,205
456,146
46市民6,072
47我哋5,862
485,807
49同埋5,803
50需要5,615
51發展5,592
525,479
53政策5,345
545,294
555,182
56金融5,061
574,955
58问题4,950
59教育4,798
604,768
61加强4,686
62服务4,513
63今年4,446
644,438
654,405
664,405
67經濟4,368
684,328
69以及4,230
704,172
71计划4,168
72包括4,138
734,124
744,087
754,079
76改革4,059
77研究4,031
783,961
793,867
80市场3,860
813,808
82收入3,784
83建设3,739
843,686
853,659
86企业3,641
873,636
88去年3,610
89改善3,605
90合作3,592
91服務3,573
92这个3,563
933,524
94支持3,504
95人士3,444
96中国3,444
97土地3,375
983,368
993,365
1003,362
Back to Top
NWordFreq.
135,334
214,674
313,775
412,787
5香港10,179
69,683
79,480
8政府9,370
98,843
108,643
118,508
127,710
137,414
14我們7,094
156,989
166,312
17我哋5,859
18同埋5,802
19發展5,587
205,100
21#4,812
224,781
23經濟4,360
244,322
254,058
26可以3,882
273,857
283,730
293,713
30服務3,567
31方面3,520
323,515
333,492
34提供3,446
353,354
36市民3,105
37措施3,087
383,077
392,958
40增加2,939
41計劃2,908
422,886
43社會2,847
442,797
452,780
46需要2,707
472,673
482,672
49工作2,525
502,375
512,342
522,252
532,228
542,102
552,072
562,059
57市場2,056
58金融2,033
592,001
60包括1,990
61以及1,964
621,945
63問題1,944
641,918
65已經1,916
661,912
67研究1,894
681,890
691,869
701,815
71政策1,790
72今年1,781
731,777
74教育1,774
751,757
761,754
771,751
781,722
79人士1,722
801,720
811,718
821,714
831,672
84收入1,640
851,610
861,593
871,564
88建議1,555
891,550
90土地1,548
91改善1,512
921,495
931,492
94加強1,480
95施政1,467
96去年1,460
971,436
981,434
99其實1,422
100報告1,417
Back to Top
NWordFreq.
188,279
224,603
3我们19,147
417,884
517,686
612,572
7发展12,239
811,938
911,278
10政府11,262
11香港10,396
1210,268
13经济9,863
149,769
157,982
167,733
17#7,393
18社会6,473
196,206
206,157
216,146
225,960
235,873
244,955
25问题4,950
264,874
27加强4,679
28工作4,593
29服务4,513
304,195
314,172
32计划4,168
334,078
34方面4,045
35增加3,912
363,907
373,861
38市场3,860
393,819
40建设3,739
41措施3,674
42企业3,641
433,637
44改革3,587
45这个3,563
46可以3,555
47政策3,555
48提供3,511
493,505
50中国3,444
513,431
523,425
533,408
543,332
55继续3,324
563,270
573,131
58金融3,028
59教育3,024
60市民2,967
612,921
62需要2,908
63财政2,863
64国家2,819
652,718
66今年2,665
672,653
682,641
692,628
702,557
71增长2,547
72支持2,531
73国际2,501
74推动2,387
75提高2,357
762,346
77合作2,341
78推进2,325
79他们2,309
80以及2,266
81投资2,264
822,258
83已经2,254
84环境2,233
852,230
86积极2,219
87就业2,217
88制度2,216
892,195
90去年2,150
91包括2,148
92收入2,144
932,142
94研究2,137
952,129
96改善2,093
97促进2,020
982,011
99重要2,010
100地区1,999
Back to Top
NWordFreq.
1THE240,433
2AND138,508
3TO126,832
4OF110,864
5IN80,540
6WE62,892
7A61,145
8THAT59,563
9FOR46,229
10WILL44,349
11IS40,311
12I33,912
13#33,206
14ON28,155
15OUR28,063
16THIS26,071
17HAVE25,731
18S25,425
19IT24,264
20WITH22,882
21BE22,386
22ARE21,821
23AS21,081
24GOVERNMENT16,094
25BY15,707
26YOU15,659
27NOT14,718
28YEAR14,322
29HAS14,222
30ER13,640
31FROM13,366
32PEOPLE13,155
33MORE12,984
34AT12,005
35SO11,243
36BUT11,050
37CAN10,678
38AN10,583
39THEIR10,539
40ALL10,466
41DEVELOPMENT10,430
42HONG10,425
43KONG10,339
44THEY10,162
45NEW9,828
46DO9,441
47ALSO9,271
48TAX9,258
49THERE8,827
50ABOUT8,747
51ECONOMIC8,496
52HE8,464
53WAS8,146
54WHAT8,041
55OR8,028
56YEARS7,938
57UP7,698
58WORK7,339
59BUDGET7,289
60PUBLIC7,118
61BEEN7,010
62NOW6,930
63WOULD6,915
64WHICH6,874
65ONE6,822
66ECONOMY6,816
67OVER6,752
68SHOULD6,704
69UM6,549
70PER6,460
71THESE6,407
72THAN5,926
73MAKE5,918
74MY5,863
75OUT5,759
76SUPPORT5,747
77LAST5,745
78GROWTH5,663
79T5,630
80N5,580
81SOME5,575
82CENT5,567
83NIL5,548
84WHO5,499
85CHINA5,495
86NEED5,481
87IF5,419
88OTHER5,371
89SERVICES5,354
90PRESIDENT5,318
91FINANCIAL5,207
92THOSE5,172
93TIME5,049
94US5,024
95SYSTEM4,990
96TWO4,941
97THINK4,908
98BILLION4,862
99WHEN4,847
100COUNTRY4,785
Back to Top
NWordFreq.
113,374
2政府4,661
34,588
4香港4,579
53,765
63,114
7發展3,023
82,973
9我們2,887
10同埋2,841
112,524
122,488
132,328
142,099
151,941
16服務1,838
171,794
18提供1,751
191,683
20我哋1,655
21#1,621
221,607
23計劃1,558
24經濟1,537
251,529
261,504
27社會1,485
281,430
291,414
301,309
31市民1,299
32工作1,241
331,163
341,156
35研究1,090
36教育1,070
37需要1,058
381,054
391,015
40增加1,013
41969
42930
43908
44加強903
45包括897
46891
47方面882
48879
49以及879
50877
51868
52861
53842
54措施840
55839
56政策837
57土地822
58人士784
59推動756
60已經743
61支援743
62可以739
63合作735
64內地733
65改善716
66715
67713
68705
69中心679
70環境659
71市場653
72650
73649
74641
75問題632
76繼續625
77金融624
78文化622
79616
80今年614
81提升609
82家庭605
83604
84602
85602
86590
87587
88特區577
89574
90推行572
91建議569
92564
93房屋563
94國際557
95委員會548
96546
97546
98545
99540
100地區539
Back to Top
NWordFreq.
123,620
26,685
35,151
4政府4,862
5香港4,725
6我们4,698
74,307
8发展3,628
93,438
102,606
112,359
122,196
132,139
142,109
15计划2,090
162,050
17服务2,010
18经济1,986
19社会1,903
201,869
211,828
22提供1,811
23#1,661
241,604
251,560
261,540
271,470
281,454
29市民1,402
301,384
311,355
32工作1,299
33加强1,258
341,231
35教育1,177
36研究1,145
371,121
38需要1,105
391,067
40增加1,062
411,044
42以及1,020
43999
44内地962
45961
46继续945
47方面936
48包括926
49支持924
50推动915
51政策911
52措施871
53人士841
54土地832
55环境820
56问题817
57816
58市场796
59786
60合作767
61改善754
62委员会741
63736
64733
65729
66资助729
67建议727
68712
69长者712
70国际709
71703
72中心685
73积极684
74680
75671
76668
77特区667
78661
79654
80646
81643
82金融641
83641
84提升640
85638
86未来638
87可以632
88文化630
89629
90家庭624
91地区620
92国家620
93619
94今年613
95603
96同时601
97推行598
98596
99595
100医疗590
Back to Top
NWordFreq.
1THE37,091
2AND20,046
3TO18,499
4OF16,115
5IN11,292
6WILL8,362
7A7,790
8FOR7,329
9WE6,655
10HONG4,572
11OUR4,567
12KONG4,515
13GOVERNMENT4,053
14WITH4,041
15IS3,646
16AS3,517
17ON3,509
18#3,473
19DEVELOPMENT3,065
20HAVE2,993
21THAT2,651
22THIS2,447
23BE2,439
24BY2,355
25HAS2,347
26ARE2,113
27S2,047
28YEAR1,933
29PUBLIC1,867
30SERVICES1,841
31FROM1,764
32I1,764
33PEOPLE1,692
34MORE1,690
35THEIR1,643
36NEW1,631
37ALSO1,629
38AN1,620
39COMMUNITY1,410
40UP1,353
41SUPPORT1,322
42AT1,312
43PROVIDE1,283
44ECONOMIC1,260
45IT1,142
46EDUCATION1,139
47YEARS1,135
48MAINLAND1,090
49CAN1,040
50FINANCIAL1,035
51SCHEME1,031
52THESE1,027
53BEEN931
54WHICH926
55ABOUT896
56ALL879
57LAND874
58CARE872
59SOCIAL846
60WORK837
61CONTINUE831
62NOT824
63OR824
64MEASURES817
65ENHANCE797
66HOUSING796
67ELDERLY774
68POLICY771
69ITS770
70TWO765
71OVER763
72QUALITY756
73INTO731
74PROMOTE725
75MARKET703
76UNDER703
77OTHER701
78NEED679
79THROUGH673
80SHOULD671
81SERVICE667
82PROJECTS661
83SUCH657
84BUSINESS652
85HELP648
86FURTHER628
87SET625
88ECONOMY624
89SYSTEM622
90NEXT616
91LAST612
92USE598
93COUNCIL597
94INTERNATIONAL587
95IMPROVE577
96INDUSTRIES576
97ENVIRONMENT575
98AREAS570
99OPPORTUNITIES569
100MUST561
Back to Top
NWordFreq.
14,001
23,660
32,916
42,504
52,268
62,160
71,921
8我哋1,780
91,621
101,520
111,481
12我們1,365
131,348
141,119
15944
16香港930
17875
18850
19施政848
20報告847
21可以769
22703
23問題701
24政府694
25671
26654
27648
28640
29613
30呢個609
31方面545
32529
33470
34465
35大家445
36其實444
37438
38需要435
39就係429
40同埋416
41市民415
42能夠406
43401
44政策398
45所以365
46353
47發展347
48345
49即係335
50已經324
51這個323
52工作316
53社會314
54311
55307
56302
57299
58如果294
59希望292
60291
61286
62267
63宜家267
64266
65264
66一啲263
67259
68覺得256
69254
70251
71246
72241
73239
74裏面238
75呢啲231
76提出229
77226
78一些226
79但係224
80224
81自己224
82221
83重要220
84包括218
85經濟217
86房屋215
87一定215
88212
89他們204
90這些202
91好多201
92201
93197
94190
95措施189
96185
97金融185
98相信185
99#182
100或者177
Back to Top
NWordFreq.
18,494
23,053
3我们2,902
42,220
51,820
61,770
71,769
81,695
91,487
101,245
111,191
121,064
13945
14这个945
15香港857
16841
17815
18报告809
19问题792
20施政783
21763
22752
23698
24646
25可以640
26政府633
27620
28607
29方面571
30487
31他们481
32就是472
33一些458
34413
35需要406
36市民404
37396
38发展388
39政策366
40社会365
41所以361
42工作336
43很多334
44329
45大家315
46经济314
47311
48这些302
49299
50已经299
51299
52没有298
53希望298
54297
55现在297
56282
57280
58甚么271
59能够267
60263
61如果261
62259
63但是257
64253
65253
66244
67其实244
68觉得237
69234
70时候223
71221
72还有220
73措施208
74时间208
75提出205
76203
77200
78重要198
79196
80#195
81这样193
82187
83186
84186
85包括185
86185
87应该184
88房屋183
89因为183
90181
91里面181
92一定180
93金融178
94177
95市场172
96其他171
97169
98167
99比较163
100土地163
Back to Top
NWordFreq.
1THE4,932
2ER3,701
3AND2,383
4TO2,329
5WE1,902
6IN1,777
7OF1,752
8UM1,512
9THAT1,415
10I1,386
11IS1,174
12A1,172
13HAVE1,100
14YOU989
15FOR824
16IT815
17WILL790
18BE770
19ARE755
20THIS749
21ON612
22SO600
23NOT583
24S574
25POLICY522
26THERE502
27HONG492
28KONG485
29DO445
30AS444
31WITH418
32ADDRESS405
33BUT394
34NOW371
35ALSO365
36CAN360
37OUR352
38PEOPLE342
39THEY339
40WHAT324
41WOULD305
42FROM283
43T262
44ALL260
45ABOUT258
46MY258
47N257
48GOVERNMENT245
49YOUR232
50WELL225
51OR219
52THESE215
53ONE199
54THEN198
55SAID190
56TIME189
57MORE188
58PUBLIC188
59IF187
60SOME187
61VERY183
62BEEN182
63SHOULD180
64AT174
65THINK173
66DEVELOPMENT171
67BY170
68HOUSING168
69HAS164
70NEW163
71AN162
72YEAR158
73NEED155
74TERM154
75M150
76WHICH150
77LIKE148
78FINANCIAL146
79WHEN144
80JUST143
81IMPORTANT142
82HOW138
83YEARS138
84CHIEF135
85ANY134
86TWO134
87COMMUNITY133
88ECONOMIC132
89VE130
90EXECUTIVE128
91GOING128
92THEM128
93THEIR124
94UP124
95BECAUSE120
96MR120
97WORK117
98MARKET113
99WANT113
100MAKE111
Back to Top
NWordFreq.
114,216
2香港4,192
33,647
43,556
5政府3,365
63,305
73,052
82,857
9我們2,810
102,608
112,606
122,472
13#2,407
14經濟2,268
15發展2,097
16同埋2,093
172,045
182,020
191,937
201,885
211,659
22我哋1,644
231,624
24服務1,584
25提供1,535
261,503
271,495
28增加1,471
291,422
301,391
31措施1,375
321,334
331,302
341,240
35計劃1,150
36市場1,138
37金融1,116
38方面1,069
39收入1,046
40開支1,039
41市民1,021
42去年949
43以及945
44可以942
45本地940
46940
47929
48929
49920
50社會916
51912
52財政880
53建議872
54868
55需要842
56包括814
57799
58增長793
59792
60企業790
61779
62773
63764
64工作760
65741
66737
67730
68730
69713
70基金712
71702
72人士682
73改善673
74研究649
75632
76625
77616
78今年615
79已經611
80596
81594
82內地591
83588
84提升587
85582
86推出577
87565
88預計562
89支援560
90557
91中心556
92555
93投資551
94548
95547
96年度533
97推動533
98532
99國際528
100加強527
Back to Top
NWordFreq.
119,306
24,772
34,736
43,748
5香港3,605
6我们3,490
7政府2,877
82,755
9经济2,715
102,404
11发展2,361
122,250
132,229
141,827
151,789
16#1,779
171,598
181,557
19计划1,430
20服务1,341
21提供1,322
22开支1,315
23增加1,268
24市场1,206
251,136
26措施1,127
271,118
28社会1,084
291,028
301,021
31财政1,005
32金融922
33906
34876
35以及872
36867
37867
38方面861
39建议819
40去年819
41本地817
42市民808
43800
44798
45收入787
46782
47企业769
48766
49759
50733
51727
52可以726
53720
54720
55增长717
56包括713
57705
58内地693
59需要693
60继续677
61666
62预计664
63加强653
64推动637
65国际616
66工作611
67基金605
68602
69601
70投资596
71584
72未来573
73就业570
74569
75推出554
76549
77提升548
78改善547
79研究546
80超过541
81528
82今年526
83人士526
84523
85资助523
86521
87支援519
88有关513
89502
90项目487
91486
92中心486
93485
94481
95已经472
96这些467
97465
98464
99进一步461
100环境454
Back to Top
NWordFreq.
1THE33,049
2OF16,463
3TO16,456
4AND15,673
5IN11,548
6#7,712
7A7,189
8FOR7,138
9WILL6,307
10WE4,723
11OUR4,097
12HONG3,991
13KONG3,960
14IS3,459
15WITH3,387
16AS3,231
17I3,148
18ON3,079
19THIS3,049
20THAT3,017
21GOVERNMENT2,852
22HAVE2,639
23YEAR2,491
24BY2,407
25BE2,351
26PER2,262
27CENT2,082
28FROM2,041
29DEVELOPMENT1,969
30S1,951
31HAS1,908
32ARE1,906
33AN1,899
34ECONOMIC1,803
35FINANCIAL1,678
36MORE1,601
37SERVICES1,538
38EXPENDITURE1,453
39ALSO1,436
40THEIR1,323
41AT1,305
42TAX1,293
43DOLLARS1,266
44BILLION1,258
45OVER1,238
46PUBLIC1,237
47YEARS1,229
48MARKET1,198
49NEW1,160
50ECONOMY1,132
51THESE1,117
52IT1,100
53MEASURES1,089
54PEOPLE1,064
55INCREASE1,055
56UP1,026
57LAST1,016
58GROWTH1,014
59PROVIDE941
60OR939
61MAINLAND924
62NOT923
63WHICH900
64BUSINESS897
65ABOUT885
66SUPPORT866
67COMMUNITY858
68REVENUE852
69CAN840
70INDUSTRY835
71BEEN819
72THAN778
73ITS765
74SCHEME761
75INTO759
76CONTINUE727
77FURTHER715
78FISCAL686
79MILLION671
80OTHER663
81SHALL661
82INTERNATIONAL659
83ADDITIONAL654
84RATE654
85TWO654
86ALL652
87SUCH636
88HELP626
89SHOULD613
90INDUSTRIES612
91FUND609
92UNDER598
93TOTAL594
94LAND589
95THROUGH572
96CARE562
97FIRST556
98ENHANCE552
99ENTERPRISES548
100PROJECTS545
Back to Top
NWordFreq.
16,805
26,172
35,261
44,279
54,084
62,986
72,759
82,367
92,366
102,321
112,095
122,044
131,757
141,623
151,466
16可以1,432
171,307
181,285
191,146
201,129
211,063
221,053
23方面1,024
241,022
25967
26其實962
27956
28911
29893
30892
31867
32所以847
33我哋780
34758
35750
36742
37724
38措施683
39666
40政府650
41646
42645
43608
44#602
45586
46如果569
47547
48一個535
49530
50521
51大家519
52因為488
53483
54香港478
55同埋452
56434
57今年428
58421
59416
60404
61395
62393
63386
64需要372
65市民370
66364
67問題363
68收入355
69或者353
70希望352
71346
72宜家345
73司長341
74經濟338
75336
76332
77好多329
78覺得322
79預算案317
80一些316
81315
82311
83增加306
84302
85296
86290
87285
88CODE283
89SWITCH283
90很多283
91282
92279
93可能273
94相信273
95273
96271
97另外267
98這個262
99260
100259
Back to Top
NWordFreq.
18,732
23,955
3我们3,375
42,224
52,193
62,064
71,832
81,784
91,589
101,570
111,363
121,289
131,221
141,128
151,088
16这个1,062
17991
18可以960
19方面905
20851
21740
22措施703
23695
24所以656
25648
26642
27预算案625
28经济601
29583
30一些525
31523
32政府514
33问题506
34其实504
35他们488
36很多460
37445
38如果442
39425
40因为420
41没有406
42香港393
43383
44379
45379
46#374
47374
48就是372
49372
50司长371
51367
52今年360
53现在357
54350
55350
56财政346
57346
58大家342
59市民336
60328
61觉得323
62希望321
63增加320
64318
65312
66但是308
67已经306
68开支301
69需要297
70比较286
71收入285
72什么282
73是否269
74未来257
75256
76这些247
77245
78还有244
79242
80情况240
81这样240
82时候239
83236
84工作234
85222
86刚才220
87220
88219
89那么217
90推出217
91207
92206
93205
94198
95可能197
96197
97去年195
98社会189
99相当188
100187
Back to Top
NWordFreq.
1THE6,593
2TO3,263
3UM3,032
4AND2,606
5WE2,277
6OF2,227
7THAT2,167
8IN2,108
9A1,841
10YOU1,825
11I1,719
12IS1,613
13HAVE1,525
14ARE1,155
15FOR1,143
16IT1,137
17BE1,086
18THIS1,076
19SO1,041
20WILL962
21NOT815
22ER782
23S675
24DO645
25OUR644
26AS611
27BUDGET599
28ON581
29WOULD542
30TAX540
31THERE536
32YEAR512
33BUT476
34#475
35NOW444
36CAN435
37ABOUT430
38AT425
39WITH422
40N411
41T411
42THINK401
43WELL393
44FROM392
45MEASURES389
46PEOPLE386
47DOLLARS380
48BILLION377
49WHAT376
50BY370
51THEY367
52ALSO362
53IF348
54OR342
55MORE332
56HONG324
57KONG323
58THEN323
59VERY320
60FINANCIAL310
61TIME304
62BECAUSE300
63SOME293
64EXPENDITURE286
65HAS283
66GOVERNMENT276
67SAID263
68YOUR261
69ALL256
70ONE249
71WAS249
72PERCENT243
73OUT242
74AN234
75YEARS229
76QUESTION226
77UP226
78THESE225
79ECONOMIC221
80WHEN217
81ECONOMY210
82HOW208
83SECRETARY206
84NEED202
85MY201
86JUST199
87M196
88SHOULD195
89LIKE188
90LAST187
91REVENUE185
92GOING184
93PUBLIC184
94WHY183
95BEEN181
96ANY176
97FIRST176
98SAY176
99MANY172
100US171
Back to Top
NWordFreq.
112,110
29,571
3发展4,533
4经济3,040
5建设3,014
62,812
7社会2,480
8#2,435
9加强2,425
10改革2,382
112,010
12推进1,887
13企业1,766
14工作1,700
151,692
161,622
171,507
18提高1,460
19政策1,440
20制度1,379
21农村1,349
22加快1,331
23政府1,322
241,308
251,304
261,294
27继续1,219
281,125
29教育1,116
30市场1,115
31实施1,108
32我们1,108
33人民1,102
34促进1,097
35积极1,087
36完善1,082
37坚持1,072
38增长1,072
39管理1,047
40国家1,046
41全面1,008
42财政980
43增加957
44944
45支持934
46扩大932
47服务930
48重点916
49914
50基本901
51投资892
52863
53保障861
54体制857
55结构832
56问题831
57829
58生产825
59地区817
60产业796
61文化792
62稳定791
63就业789
64农业787
65创新784
66761
67体系739
68金融736
69基础732
70技术729
71国际726
72群众715
73进一步711
74705
75安全697
76中央687
77深化686
78机制684
79672
80666
81环境663
82今年661
83收入653
84事业647
85水平646
86重大646
87调整641
88638
89634
90大力632
91保护631
92617
93科技615
94重要613
95建立602
96599
97实现597
98能力593
99合作590
100推动590
Back to Top
NWordFreq.
1AND31,201
2THE26,505
3OF17,418
4TO13,787
5WE11,901
6IN9,742
7WILL8,065
8FOR5,766
9A4,873
10DEVELOPMENT3,662
11ON2,878
12#2,747
13WITH2,647
14PEOPLE2,164
15S2,146
16THAT2,144
17SYSTEM2,133
18ECONOMIC2,060
19GOVERNMENT2,058
20IMPROVE2,024
21OUR1,982
22BE1,944
23WORK1,901
24REFORM1,890
25ALL1,856
26RURAL1,852
27SHOULD1,811
28IS1,721
29AS1,678
30BY1,635
31AREAS1,605
32CHINA1,599
33MORE1,572
34THEIR1,568
35ARE1,465
36NEW1,461
37SOCIAL1,418
38INCREASE1,275
39THIS1,272
40YEAR1,236
41ENTERPRISES1,218
42MUST1,172
43UP1,143
44FROM1,141
45AN1,139
46WAS1,116
47HAVE1,110
48PUBLIC1,096
49PROMOTE1,077
50STRENGTHEN1,052
51MAKE1,050
52NEED1,036
53AT1,023
54CONTINUE1,022
55URBAN991
56DEVELOP960
57MADE934
58CENTRAL919
59SUPPORT919
60ENSURE899
61EDUCATION896
62MAJOR893
63INVESTMENT871
64MARKET871
65EFFORTS858
66GROWTH851
67WERE830
68OUT829
69SERVICES811
70YUAN811
71NATIONAL784
72IMPLEMENT778
73FINANCIAL772
74BASIC767
75POLICY767
76PROJECTS756
77LAW755
78CHINESE744
79OTHER739
80POLICIES731
81INDUSTRIES730
82ECONOMY726
83PROGRESS678
84OVER675
85PRODUCTION669
86MANAGEMENT654
87IT652
88OR596
89STATE595
90MEASURES584
91CULTURAL572
92ENERGY572
93AGRICULTURAL569
94ENCOURAGE567
95KEY567
96USE558
97PROBLEMS556
98COUNTRY554
99ITS553
100HAS549
Back to Top
NWordFreq.
114,441
23,956
33,211
4我们2,972
52,532
6中国2,357
72,290
82,253
92,187
101,826
111,795
121,622
13问题1,619
141,590
151,504
161,465
171,435
18经济1,163
191,117
20发展1,034
21政府1,030
22985
23965
24928
25这个900
26#835
27谢谢822
28815
29总理780
30779
31改革748
32684
33664
34647
35643
36643
37关系627
38但是620
39国家612
40香港611
41人民597
42记者580
43565
44551
45现在546
46已经544
47ER539
48535
49526
50524
51504
52可以502
53台湾481
54469
55就是469
56金融441
57441
58社会430
59424
60企业423
61RAISE411
62VOICE411
63什么403
64市场401
65399
66工作390
67今年380
68一些379
69没有376
70363
71358
72357
73357
74政策351
75世界350
76进行346
77343
78343
79认为340
80解决337
81方面335
82大家333
83美国332
84能够326
85去年326
86322
87使322
88国际321
89而且320
90320
91314
92314
93309
94推进309
95合作307
96303
97希望297
98中央290
99还是289
100他们282
Back to Top
NWordFreq.
1THE17,052
2AND9,107
3TO8,063
4OF7,848
5IN5,623
6WE4,036
7THAT3,795
8A3,774
9IS3,401
10I3,142
11WILL2,669
12CHINA2,489
13HAVE2,368
14THIS2,346
15FOR2,204
16YOU1,884
17S1,859
18ON1,716
19BE1,633
20ARE1,584
21PEOPLE1,563
22WITH1,473
23AS1,458
24GOVERNMENT1,423
25OUR1,413
26HAS1,376
27IT1,373
28ER1,251
29ALSO1,088
30NOT1,051
31YEAR925
32WHAT896
33SO889
34#876
35BY847
36CHINESE829
37CAN818
38AT814
39DEVELOPMENT813
40THERE810
41ALL803
42FROM795
43KONG757
44HONG750
45REFORM740
46TWO728
47BUT727
48DO726
49TAIWAN723
50ECONOMIC717
51ABOUT675
52BEEN663
53ONE663
54WOULD635
55MORE619
56MY618
57SOME613
58THEIR589
59VERY577
60NEED573
61YOUR572
62AN560
63BETWEEN558
64QUESTION557
65THEY545
66COUNTRY521
67YEARS518
68TIME515
69US511
70WORK510
71COUNTRIES509
72SHOULD506
73OR502
74THESE497
75ECONOMY481
76ITS475
77GROWTH463
78WAS452
79MAKE433
80LIKE432
81NEW427
82NOW426
83MARKET424
84LAST417
85FINANCIAL412
86MUST412
87SUCH399
88TAKE393
89UP388
90THINK385
91PREMIER384
92BELIEVE366
93WORLD363
94NIL362
95THANK358
96ME357
97IF355
98SYSTEM354
99OTHER353
100STILL351
Back to Top

Table 8. The 100 Most Frequent Words in the Subset of US State of the Union Addresses

NWordFreq.
1THE12,441
2AND10,176
3TO9,625
4OF7,142
5WE5,903
6A5,136
7IN4,832
8OUR4,717
9THAT4,337
10FOR2,873
11I2,692
12IS2,679
13WILL2,293
14S2,159
15THIS2,077
16IT1,965
17HAVE1,875
18APPLAUSE1,861
19ARE1,727
20ON1,658
21WITH1,627
22YOU1,584
23INTERRUPTION1,410
24AMERICA1,392
25MORE1,355
26CHEERS1,339
27NOT1,332
28THEIR1,264
29THEY1,202
30ALL1,190
31BY1,187
32BE1,144
33AS1,130
34CAN1,117
35BUT1,102
36FROM1,038
37PEOPLE1,034
38NEW1,031
39DO1,015
40#990
41WHO939
42MUST920
43SO919
44US885
45OR852
46NOW850
47HAS832
48AMERICAN786
49AT760
50YEARS753
51EVERY739
52WORLD736
53AMERICANS725
54N709
55T697
56YEAR682
57MAKE667
58WORK665
59AN651
60ONE635
61THAN635
62THEM623
63THESE622
64SHOULD607
65HELP592
66CONGRESS583
67COUNTRY575
68TONIGHT572
69VE563
70WHAT560
71WHEN555
72IF529
73NATION505
74JOBS499
75MY496
76NO492
77TIME490
78LET486
79KNOW485
80BECAUSE481
81NEED478
82CHILDREN470
83SECURITY470
84ECONOMY463
85ALSO457
86RE453
87UP444
88LAST438
89WAS426
90JUST417
91TAX413
92THERE411
93GOVERNMENT398
94THOSE398
95LIKE388
96FIRST386
97HEALTH386
98OVER382
99CARE378
100ASK376
Back to Top

Table 9. The 100 Most Frequent Words in the Subset of Press Conferences of US State of the Union Addresses

NWordFreq.
1THE15,801
2TO9,141
3THAT9,102
4AND7,122
5OF6,288
6A4,740
7I4,619
8IN4,610
9IS3,838
10S3,683
11ER3,654
12WE3,543
13IT3,290
14YOU3,145
15PRESIDENT2,926
16ON2,601
17THIS2,325
18HAVE2,169
19HE1,956
20FOR1,913
21ARE1,653
22WITH1,647
23BE1,626
24ABOUT1,604
25THINK1,596
26DO1,551
27AS1,498
28WHAT1,495
29NOT1,467
30N1,436
31WILL1,427
32T1,422
33BUT1,388
34THERE1,331
35THEY1,310
36WAS1,227
37HAS1,190
38SO1,107
39WOULD1,060
40AT991
41GOING987
42OUR981
43OR968
44CAN952
45RE944
46WELL883
47AN866
48HIS774
49PEOPLE768
50IF721
51KNOW711
52UM699
53ONE677
54SOME676
55BEEN673
56JUST652
57FROM646
58THEIR634
59OUT622
60ALL620
61MORE619
62BY611
63THOSE601
64SAID591
65ANY541
66THESE535
67NO528
68WHO505
69WHEN489
70GET487
71VE487
72HOW481
73MAKE478
74BECAUSE473
75M469
76DOES468
77WERE467
78VERY460
79ALSO449
80AGAIN443
81OTHER441
82#440
83UP440
84NOW436
85HOUSE423
86HAD422
87LL409
88LIKE406
89SAY399
90QUESTION389
91WAY385
92DID380
93LAST380
94STATES378
95GO377
96TAKE371
97IMPORTANT370
98FORWARD362
99CONGRESS358
100THEM358
Back to Top

Table 10. The 100 Most Frequent Words in the Subset of US Budget Speeches

NWordFreq.
1THE3,454
2TO2,996
3AND2,254
4THAT1,703
5WE1,686
6OF1,669
7A1,418
8IN1,313
9I1,101
10IT962
11IS881
12OUR881
13S867
14FOR808
15YOU715
16THIS535
17BUDGET520
18ARE493
19HAVE491
20ON464
21DO434
22AS405
23RE399
24BE380
25WILL368
26WITH354
27THEY342
28NOT325
29N322
30T322
31CAN319
32BUT306
33SO304
34#303
35VE298
36BY297
37MORE292
38PEOPLE288
39MAKE286
40ABOUT280
41WHAT266
42GOING253
43AMERICA245
44IF240
45AT238
46ALL228
47THEIR228
48WHO218
49THERE217
50US205
51NEW200
52NOW198
53ECONOMY194
54WHEN190
55AN188
56FROM188
57ONE185
58APPLAUSE182
59WANT182
60OR173
61SECURITY173
62NEED172
63TAX172
64CONGRESS170
65GOT170
66YEARS168
67HERE166
68ALSO162
69MY162
70UP162
71HAS160
72M155
73THANK153
74SURE152
75THEM152
76YOUR150
77GET144
78ER143
79AMERICAN142
80BEEN142
81SOME141
82WORK140
83JOBS138
84SPENDING138
85WHY137
86THESE136
87MONEY134
88COUNTRY132
89KEEP132
90WAY132
91KNOW131
92JUST130
93LIKE130
94WELL130
95OUT129
96BECAUSE127
97GOVERNMENT126
98WAS126
99TIME124
100WHICH122
Back to Top

Table 11. The 100 Most Frequent Words in the Subset of Press Conferences of US Budget Speeches

NWordFreq.
1THE19,885
2THAT10,029
3TO9,722
4AND8,374
5OF8,277
6IN7,435
7WE6,505
8A6,276
9IS5,065
10YOU4,049
11IT4,019
12S3,732
13I3,582
14FOR3,319
15ON2,868
16THIS2,690
17ARE2,628
18ER2,623
19BUDGET2,518
20#2,334
21BE2,297
22HAVE2,277
23AS2,074
24SO1,850
25DO1,749
26OUR1,748
27WHAT1,736
28WITH1,691
29THERE1,672
30NOT1,610
31BUT1,563
32WILL1,475
33RE1,424
34ABOUT1,423
35PRESIDENT1,417
36THINK1,415
37AT1,408
38TAX1,359
39WOULD1,326
40YEAR1,316
41PERCENT1,253
42T1,229
43N1,215
44WAS1,104
45IF1,088
46UM1,083
47HAS1,045
48FROM1,024
49THEY1,024
50OR1,013
51BY1,006
52MORE992
53CAN973
54WHICH894
55SPENDING889
56THOSE889
57JUST881
58ONE870
59OVER858
60YEARS843
61AN834
62SOME830
63DEFICIT829
64ALL793
65VE782
66VERY777
67GROWTH773
68GOING765
69BEEN715
70DOLLARS709
71ALSO695
72OUT662
73LAST660
74WELL654
75ECONOMY643
76HOW611
77GET604
78THAN596
79BILLION592
80KNOW590
81NOW577
82UP577
83WHEN577
84PEOPLE575
85CONGRESS567
86BECAUSE564
87SECURITY557
88THESE554
89WERE540
90HE521
91WHO517
92OTHER512
93PROGRAMS506
94TIME504
95DOES502
96MAKE495
97FIRST490
98WHERE476
99ECONOMIC470
100SEE451
Back to Top

Table 12. The 100 Most Frequent Words in the Subset of UK State Opening Addresses of Parliament

NWordFreq.
1THE1,964
2TO1,865
3AND1,393
4WILL1,368
5OF1,197
6MY776
7GOVERNMENT704
8A628
9BE595
10IN536
11FOR454
12LEGISLATION281
13FORWARD280
14BILL231
15ON228
16CONTINUE221
17WORK214
18INTRODUCED210
19S173
20WITH170
21THAT169
22REFORM166
23HOUSE150
24PUBLIC147
25NEW142
26MEMBERS133
27BROUGHT127
28COMMONS127
29MEASURES121
30PEOPLE121
31ENSURE117
32UNITED113
33IMPROVE111
34ALSO107
35NATIONAL106
36BY104
37MORE104
38LORDS100
39SERVICES100
40SUPPORT100
41IS99
42SECURITY91
43INTRODUCE89
44SYSTEM89
45THEY82
46KINGDOM81
47ECONOMIC79
48THEIR79
49BRING72
50HELP71
51HEALTH70
52IT70
53PROVIDE69
54ITS68
55REDUCE68
56PROMOTE67
57#64
58BEFORE64
59FROM64
60INCLUDING63
61OUR63
62ARE61
63EUROPEAN61
64I61
65SERVICE61
66LAID60
67YOU60
68POWERS59
69THIS59
70AN58
71AS58
72INTERNATIONAL57
73STATE57
74TACKLE57
75AT56
76MAKE54
77MINISTERS54
78UNION54
79CRIME53
80ALL52
81DRAFT52
82FURTHER52
83OTHER52
84ECONOMY51
85LOOK51
86PROPOSALS51
87STRENGTHEN51
88TAKE51
89COMMITTED50
90LAW50
91SECURE50
92CREATE47
93WALES45
94CHILDREN44
95EDUCATION44
96VISIT44
97DEVELOPMENT42
98FINANCIAL42
99GREATER41
100PUBLISHED41
Back to Top

Table 13. The 100 Most Frequent Words in the Subset of Debates on UK State Opening Addresses of Parliament

NWordFreq.
1THE3,332
2TO1,754
3OF1,660
4AND1,468
5IN1,124
6I1,095
7A1,043
8THAT1,010
9IS668
10FOR594
11IT483
12WE481
13MY473
14WAS427
15AS423
16ON410
17BE409
18THIS396
19HAVE374
20OUR340
21NOT333
22ARE317
23WITH294
24BUT284
25HE277
26S266
27#246
28WILL246
29HAS242
30HOUSE240
31ALL218
32WHO214
33SPEECH205
34INTERRUPTION201
35WHICH200
36AT198
37BY198
38YOUR198
39AN195
40ONE188
41GRACIOUS185
42MAJESTY185
43MOST183
44THEY175
45FROM165
46HIS165
47SO164
48HEAR161
49THERE161
50CAN157
51MORE155
52PARLIAMENT149
53HER144
54BEEN143
55WHEN141
56PEOPLE140
57DO139
58WHAT139
59ME135
60GOVERNMENT132
61YOU132
62AM125
63LORDS125
64THEIR125
65HAD117
66NOBLE115
67YEARS115
68FIRST113
69ADDRESS112
70ABOUT109
71US108
72THOSE105
73NOW104
74OR104
75GREAT100
76FRIEND99
77MAY99
78TIME99
79WOULD99
80ONLY95
81LAUGH94
82NO93
83LORD91
84BOTH90
85SHOULD88
86KNOW87
87SHE87
88SAY85
89MR83
90BEG80
91MANY79
92RIGHT79
93IF78
94LOYAL78
95OTHER78
96VERY76
97MUST75
98THEM75
99LIKE74
100MEMBER74
Back to Top

Table 14. The 100 Most Frequent Words in the Subset of UK Budget Speeches

NWordFreq.
1THE27,247
2TO16,377
3AND14,716
4OF11,902
5IN10,286
6#9,572
7A7,959
8WE7,824
9FOR7,080
10THAT6,826
11WILL6,752
12IS4,877
13I4,840
14OUR4,291
15THIS4,203
16ON3,472
17YEAR3,266
18HAVE3,065
19BE3,040
20IT2,946
21BY2,818
22TAX2,763
23ARE2,751
24PER2,728
25WITH2,615
26CENT2,545
27FROM2,516
28Â2,423
29AS2,247
30NEW2,105
31HEAR2,033
32S2,008
33CAN2,000
34TODAY1,985
35MORE1,977
36BUT1,863
37AT1,836
38SO1,770
39POUNDS1,628
40NOT1,571
41NOW1,527
42BRITAIN1,478
43NEXT1,464
44HAS1,458
45THEIR1,454
46INTERRUPTION1,430
47MR1,405
48ALSO1,395
49SPEAKER1,359
50THAN1,354
51PEOPLE1,353
52ALL1,349
53GOVERNMENT1,329
54YEARS1,237
55ECONOMY1,225
56OVER1,217
57BUDGET1,189
58BILLION1,163
59HELP1,154
60THEY1,129
61AN1,123
62WORK1,038
63SUPPORT1,035
64INVESTMENT1,025
65DEPUTY1,022
66COUNTRY994
67WHICH963
68PUBLIC928
69THOSE909
70UP907
71GROWTH905
72MILLION894
73WHO880
74SPENDING878
75FIRST874
76ONE872
77BUSINESS862
78DO845
79LAST831
80RATE828
81WORLD800
82EVERY790
83FURTHER777
84OUT776
85BEEN762
86PAY755
87THERE746
88NATIONAL745
89MAKE724
90ECONOMIC719
91AM707
92INCOME703
93FORECAST702
94TIME682
95WOULD679
96FUTURE664
97THESE664
98BUSINESSES663
99DEBT661
100WAS647
Back to Top

Table 15. The 100 Most Frequent Words in the Subset of Debates on US Budget Speeches

NWordFreq.
1THE27,280
2TO10,580
3THAT10,089
4OF9,164
5AND8,830
6IN7,124
7IS6,525
8A6,298
9HE4,719
10FOR4,276
11WE4,087
12IT3,895
13I3,824
14#3,697
15ON3,505
16NOT3,219
17HAVE3,107
18CHANCELLOR3,059
19WILL2,896
20ARE2,797
21S2,628
22BE2,469
23HAS2,453
24THIS2,446
25THEY2,149
26AS1,903
27BUT1,846
28TAX1,784
29GOVERNMENT1,757
30WITH1,701
31HIS1,693
32BY1,540
33WHAT1,482
34WAS1,455
35PEOPLE1,438
36MORE1,287
37AT1,280
38ABOUT1,269
39YEAR1,250
40HON1,223
41OUR1,213
42FROM1,162
43WHICH1,152
44WOULD1,147
45ALL1,108
46DO1,095
47CAN1,087
48BUDGET1,086
49THERE1,079
50BEEN1,034
51TODAY1,001
52SO998
53RIGHT992
54NOW977
55INTERRUPTION970
56THEIR937
57Â932
58WHO932
59THAN930
60MY913
61IF904
62US900
63AN877
64WHEN872
65ER857
66PER856
67YEARS836
68PUBLIC831
69HEAR817
70ONE795
71SAID791
72CENT788
73GROWTH786
74UP772
75MR770
76ECONOMY755
77OUT754
78SHOULD745
79THOSE726
80NO708
81COUNTRY701
82OR683
83BECAUSE676
84JUST645
85LABOUR634
86HAD622
87SPENDING619
88GENTLEMAN614
89OVER611
90DOES606
91SPEAKER600
92LAST599
93MINISTER576
94T570
95DID566
96THEM565
97FRIEND561
98MONEY561
99BILLION553
100WHY552
Back to Top
NWordFreq.
11,201
2我们476
3294
4283
5217
6214
7201
8198
9196
10中国186
11181
12165
13164
14155
15140
16合作138
17美国133
18问题112
19关系109
20人民106
2189
2286
2384
2481
2580
26国家75
27发展73
28共同67
2965
3063
31双方61
32进行58
33和平56
3454
35总统54
3653
37世界53
3852
39努力52
4050
4148
4247
4346
4446
4545
4644
47能够43
48重要43
49主席43
50达成42
5141
5241
53国际40
54奥巴马39
55今天39
56领域39
5739
58#38
5938
60解决38
6138
62坚持37
6337
64同意37
65这些37
6636
67取得36
68认为36
6936
70继续35
71应该35
7234
73加强34
74通过34
7534
76以及34
77访问33
7833
7932
8031
81欢迎31
8231
83网络31
8431
8530
86实现30
87讨论30
88一些30
8929
90稳定29
91之间29
92变化28
93他们28
94已经28
9527
96气候27
97相互27
98地区26
99分歧26
100夫人26
Back to Top
NWordFreq.
1THE3,244
2AND2,697
3TO2,081
4OF1,496
5WE1,162
6THAT1,033
7IN1,014
8A837
9I666
10CHINA637
11OUR636
12S595
13IS585
14HAVE537
15ON520
16FOR410
17IT371
18ARE362
19WITH351
20AS346
21WILL331
22YOU330
23THIS329
24UNITED323
25STATES299
26PRESIDENT259
27TWO236
28COOPERATION232
29COUNTRIES213
30NOT208
31OTHER194
32WORK189
33CAN186
34ALL184
35PEOPLE183
36BE179
37SO173
38AT171
39WORLD167
40HAS163
41MORE161
42THERE160
43TOGETHER153
44OR152
45#151
46BETWEEN151
47BY144
48AN143
49ER140
50CHINESE139
51BUT138
52WHAT138
53ALSO136
54XI133
55ISSUES130
56NEW128
57FROM123
58RELATIONS121
59DO119
60THANK117
61ISSUE114
62SECURITY113
63U111
64BOTH109
65NUCLEAR109
66VE109
67THEY108
68IMPORTANT106
69BEEN105
70MAKE105
71INTERNATIONAL104
72MY104
73RELATIONSHIP103
74ABOUT102
75CHINA-U100
76THINK98
77VERY98
78RE94
79CONTINUE93
80DEVELOPMENT92
81PROGRESS92
82NATIONS90
83ONE90
84WHICH90
85ME89
86SHOULD89
87IF88
88SIDES88
89SOME87
90WHEN87
91AGREEMENT85
92GLOBAL85
93N85
94TODAY85
95NOW84
96AMERICAN83
97WANT82
98HAD81
99SAID81
100AGREED80
Back to Top
NWordFreq.
1375
2我们126
3118
4116
5中国100
6英国92
789
887
9关系83
1078
11#76
1275
1370
1466
1562
1659
1748
18合作47
1946
2042
21发展40
2239
23世界35
2434
2531
26共同31
2730
28人民30
2929
30访问26
3126
3225
33经济24
34今天23
3523
3623
3722
38国家22
39贸易21
40问题21
41中英21
4220
4320
4420
4520
46作为20
4719
48增长19
49双方18
5017
5117
52国际16
53进行16
54女王16
55首相16
56讨论16
57投资16
58重要16
59使15
60成为14
61伙伴14
62机遇14
6314
64卡梅伦14
65政府14
66总理14
67第二13
6813
6913
7013
71建立13
72能够13
73陛下12
7412
7512
76欢迎12
77可以12
78不仅11
79达成11
80各位11
81过去11
82联合国11
83认为11
84以来11
85之间11
86表示10
8710
88方面10
89国事10
9010
91今年10
92朋友们10
93全球10
94相互10
95已经10
96战略10
9710
98尊敬10
99安理会9
1009
Back to Top
NWordFreq.
1THE563
2AND462
3TO294
4OF246
5WE182
6IN176
7A171
8CHINA145
9OUR138
10FOR98
11I87
12IS80
13#76
14THAT76
15HAVE72
16THIS72
17UK72
18S62
19ALSO60
20AS58
21ARE54
22ON54
23BETWEEN53
24COUNTRIES50
25RELATIONSHIP47
26CHINESE46
27IT44
28PEOPLE44
29WITH44
30MORE40
31BOTH38
32VISIT38
33WILL38
34WHICH36
35AN34
36BUT34
37WORLD34
38COOPERATION32
39NOT32
40TODAY32
41TWO32
42BRITAIN30
43ECONOMIC30
44VE30
45CAN28
46HAS28
47TRADE28
48YEARS28
49DEVELOPMENT26
50FIRST26
51SHOULD26
52SO26
53UNITED26
54WAS26
55NEW25
56BE24
57BRITISH24
58FROM24
59GLOBAL24
60INVESTMENT24
61PARTNERSHIP24
62TOGETHER24
63PRESIDENT22
64TIME22
65YEAR22
66KINGDOM20
67ONE20
68RELATIONS20
69ALL18
70BILATERAL18
71BILLION18
72BY18
73GROWTH18
74ISSUES18
75SEIZE18
76THEY18
77VERY18
78WELCOME18
79WELL18
80WORK18
81COUNTRY16
82FUTURE16
83GOOD16
84LAST16
85LIKE16
86M16
87MARKS16
88MR16
89MY16
90NOW16
91PRIME16
92PRINCE16
93QUEEN16
94SECURITY16
95TIES16
96YOU16
97US15
98AT14
99DISCUSSED14
100HE14
Back to Top

Publications & Useful Links

This part includes a list of publications on and related to CEPIC. You can also find some useful links to corpora as well as conferences, seminars and workshops relevant to political discourse and its translation/interpreting. The information of this page will be updated periodically.

You can access more updates of the CEPIC from https://sites.google.com/a/hkbu.edu.hk/cepic-the-chinese-english-political-interpreting-corpus/, or through following our Facebook / Twitter account.

We would appreciate it if you could send us information of your publications or works based on or related to CEPIC via this link: https://hkbuhk.ca1.qualtrics.com/jfe/form/SV_a97lXo4AKTh0hbT. Selected list of publications and links will be included on this webpage.

Publications on and related to CEPIC

Journal Articles

Pan, J. (forthcoming). The pragmatics of political discourse: An analytical framework and a comparative study of policy speeches in the United Kingdom and Hong Kong. Bandung: Journal of the Global South.

Pan, J., & Wong, T. M. (forthcoming). Developing Pragmatic Competence in Chinese–English Political Retour Interpreting: A Corpus-Driven Exploratory Study of Pragmatic Markers., inTRAlinea.

Pan, J., & Wong, T. M. (2018). A corpus-driven study of contrastive markers in Cantonese‒English political interpreting. BRAIN – Broad Research in Artificial Intelligence and Neuroscience, pp. 168-176.

Pan, J., & Wang, H. H. (2008). Communication between the speaker and the interpreter. Journal of Jiangsu University (Social Science Edition), 10(4), 77–80.

Pan, J. (2007). Two styles of interpretation: Reflection on the influence of oriental and western thought patterns on the relationship between the speaker and the interpreter. Foreign Language and Culture Studies, 6, 677–688.

Conference Papers

Pan, J., (2018, 12-14 September). The use of contrastive markers in English policy speeches: A corpus-based cross-modality comparison of but and however in interpreted and non-interpreted language. Paper presented at the fifth edition of the Using Corpora in Contrastive and Translation Studies conference (UCCTS 2018), Université catholique de Louvain, Belgium. In Sylviane Granger, Marie-Aude Lefer and Laura Aguiar de Souza Penha Marion (eds), Book of Abstracts, pp. 138-139.

Pan, J., & Wong, T. M. (2018, 3-6 July). Pragmatic strategies in political interpreting: A study of pragmatic markers in interpreted political speeches. Paper presented at the IATIS (International Association for Translation and Intercultural Studies) 6th International Conference, Hong Kong Baptist University, Hong Kong.

Pan, J., (2018, 22-24 June). A corpus-based study of the rendition of contrastive markers in Chinese‒English political interpreting. Paper presented at the Corpora and Discourse International Conference, Lancaster University, UK.

Pan, J., (2018, 18-20 June). Pragmatic strategies applied in institutional translation: A case study of the translation of two contrastive markers in Hong Kong’s policy addresses. Poster presentation at the TRANSIUS Conference, University of Geneva, Switzerland.

Pan, J., (2017, December). Developing pragmatic competence in Chinese-English political interpreting. Paper presented at the First National Forum on Diplomatic Discourse and Translation, Henan, PRC.

Pan, J., & Wong, T. M. (2017, September). Developing pragmatic competence in political retour interpreting: A corpus-driven study on the use of pragmatic markers. Paper presented at Teaching Translation and Interpreting 5, University of Łódź, Łódź, Poland.

Pan, J., & Wong, T. M. (2017, September). A Corpus-driven Study of Contrastive Markers in Cantonese‒English Political Interpreting. Paper presented at SMART 2017 – Scientific Methods in Academic Research and Teaching, Timisoara, Romania.

Pan, J., & Wong, T. M. (2015, December). Pragmatic markers in interpreted political discourse: A corpus-driven study. Paper presented at the International Conference on Corpus Linguistics and Technology Advancement (CoLTA), Hong Kong.

Pan, J., & Wong, T. M. (2015, September). Investigating pragmatic markers in interpreted political speeches from Chinese to English. Paper presented at the International Conference “Found in translation – translations are the children of their times”, Bucharest, Romania.

Useful Links

Corpora/dataset relevant to political discourse and its translation/interpreting (in alphabetical order)

The Corpus of Political Speeches

The Digital Corpus of the European Parliament

The European Comparable and Parallel Corpora

The European Parliament Interpreting Corpus

European Parliament Translation and Interpreting Corpus

The UN Parallel Text

The WAW corpus

中国外交话语语料库

Conferences/seminars/workshops relevant to political discourse and its translation/interpreting (newest on top)

Translating and Interpreting Political Discourse (TIPD 2019) (19-20 June 2019)

Translation as Political Act (9-11 May 2019)


Project Team & Acknowledgements

Project Team

Principal Investigator: Dr. Jun PAN (Associate Professor, Translation Programme, Hong Kong Baptist University)

Co-Investigator & Special Consultant to the CEPIC: Dr. Billy Tak Ming WONG (Research Coordinator, University Research Centre, Open University of Hong Kong)

Senior Adviser to the CEPIC: Ms. Rebekah WONG (Head of Digital and Multimedia Services Section, Library, Hong Kong Baptist University)

We would like to thank the following research assistants and student helpers for their contribution to the data preparation and library staff for providing support to the technical aspects of the CEPIC.

Research Assistants:
Mr. Fernando GABARRON BARRIOS
Mr. Steven Haoshen HE
Miss Chris Chencheng KUANG
Miss Hannah Qiuhan LIN
Mr. William Dongpeng PAN
Miss Jennifer Lok Man WONG
Miss Grace Jing ZHANG

Student Helpers:
Mr. Antonio Yijiao GUO
Mr. Hank Lin HAN
Miss Rigel Chung Ting PAK
Miss Gladys Hiu Man SHIU
Miss Jess Lin Wing SZE
Miss Tammy Cho Ying TANG
Miss Janny Chi Wai WONG
Miss Alice Yuxin YANG
Mr. Niko Donghuan ZHANG

Library Colleagues:
Mr. Wing Chung YIP
Mr. Timothy Sit YEUNG
Miss Sharon Suk Man YU
Miss Katie Kee Yee CHENG

From left to right: Tammy TANG, Janny WONG, Dr. Jun PAN, Antonio GUO, Fernando GABARRON BARRIOS & William PAN

From left to right: Tammy TANG, Janny WONG, Dr. Jun PAN, Antonio GUO, Fernando GABARRON BARRIOS & William PAN

From left to right, front row: Fernando GABARRON BARRIOS, Dr. Jun PAN & William PAN
From left to right, back row: Janny WONG, Alice YANG, Rigel PAK, Tammy TANG & Gladys SHIU

From left to right, front row: Fernando GABARRON BARRIOS, Dr. Jun PAN & William PAN
From left to right, back row: Janny WONG, Alice YANG, Rigel PAK, Tammy TANG & Gladys SHIU

Meeting with research asssistants and student helpers

Meeting with research asssistants and student helpers

Meeting with research asssistants and student helpers

Meeting with research asssistants and student helpers

Acknowledgements

The CEPIC is developed with the funding and support from:

  • The Early Career Scheme (ECS) of the Research Grants Council (Project Title: Interpreting into the B language: A corpus-oriented study of pragmatic markers in interpreted political speeches from Chinese to English, Project No.: 22608716);
  • The Digital Scholarship Grant of the Hong Kong Baptist University (Project Title: The Chinese-English Political Interpreting Corpus (CEPIC): An Online Corpus for the Study of Interpreted Political Speeches); and
  • The Faculty Research Grant of the Hong Kong Baptist University (Project Title: The Use of Pragmatic Markers in Chinese-English Political Interpreting: A Corpus-oriented Study, Project No.: FRG2/17-18/046)

We would like to express our gratitude to the funding bodies for making the work on this corpus possible.

Our appreciation also goes to the Hong Kong Baptist University Library for providing advice on data structure, uploading the corpora, and helping in designing the website and related search functions.