Select an English corpus (330.6 million words in all):
Europarl (ca. 25.7 mill. words, no password) Wikipedia A (ca. 35.3 mill. words, no password) Wikipedia B (ca. 40.7 mill. words, no password) Wikipedia C (ca. 39.1 mill. words, no password) BNC-written (47.2 mill. words, 55.8 mill. tokens, password) BNC-spoken (20.2 mill. words, 23 mill. tokens, password) Chat corpus (ca. 23.5 mill. words, no password) UCLA CSA television news 2005-2009 (ca. 12.4 mill. words, 13.2 mill. tokens) UCLA CSA television news 2010-2012 (ca. 10.6 mill. words, 11.2 mill. tokens) Wikipedia Talkpages (ca. 10.2 mill. words, 12.4 mill. tokens) Supreme Court Dialogues (ca. 2.05 mill. words, 2.48 mill. tokens) KEMPE (8.9 mill. words, 10.7 mill. tokens, no password) Enron e-mails A (ca. 27.5 mill. words, 32.1 mill. tokens) Enron e-mails B (ca. 27.5 mill. words, 32.1 mill. tokens) Enron e-mails C (ca. 27.5 mill. words, 32.1 mill. tokens) E-mail corpus (2.7.mill words, 3.3 mill. tokens, password, SDU only) E-mail openings (110.000 words, 127.000 tokens, password, SDU only) Beauty blog (304.000 words, 323.000 tokens)
Case insensitive Diacritics insensitive