/httpd/html/Corpus Eye

Corpus sources and copyright

We are grateful to all organisations and individuals who have provided/licensed corpus texts for use at the Institute of Language and Communication (ISK) at the University of Southern Denmark.

Credits: For your publications or other references, please use the text and provider details listed below for the individual corpora. For annotation and site credits, see also our work credits page.

Please note that corpus search engines are meant to provide researchers with language data and statistics, not running text. Thus, ordinary copyright still holds. This implies for instance that you mustn't try to extract larger, contiguous text portions from any of the corpora.


Danish corpus sources:

  • LOKE, an online news and literature magazine, copyright Arne Herløv Petersen
  • Parliamentary debates, from the Danish Folketing, kindly provided by webmaster Benny Høyer
  • Udklipsbureauet, prose fiction by Ole Dalgaard
  • Bar el Gazel, prose fiction by Ole Dalgaard
  • Litteraturvidenskaben siden nykritikken, by Ole Sauerberg (2000)
  • Ret og pligt i det 17. århundrede, by Knud E. Korff (1996)
  • Skalk is a Danish journal of archaeology (ISSN 0560-1894)
  • Munk-korpus, by Ulrik Petersen (2006)
  • Europarl is the Danish part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
  • Wikipedia is the Danish part of a 9 language, Google-style snapshot of Wikipedia, The Free Encyclopedia (originally 2005, renewed for Danish 2018). What CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • Korpus 90/2000, a mixed genre "quote corpus", was compiled by DSL (Det Danske Sprog- og Litteraturselskab), and grammatically annotated with VISL tools in a joint venture framework. The text corpus, Den Danske Ordbogs Citatkorpus (DDOC-korpus), is a subset of Den Danske Ordbog (DDO) korpus. In the construction of the DDOC corpus the following steps were used: 1. automatic orthographic sentence chunking (though with some errors due to ambiguous full stops, in particular, 2. removal of a random third of the sentences, 3. randomised ordering of the remaining sentences. Another subset of the DDO corpus is Korpus 90, which contains all DDO texts from 1988-92 (25 million running word forms). Korpus 90/2000 is accessible at the website of the Korpus 2000 project.
  • Korpus 2010 is the newest and largest DSL corpus, compiled for the Danish CLARIN-project (2008-2011). Apart from news data, it contains a substantial section with texts from blogs and internet fora.
  • The Leipzig corpus is the Danish section of the Leipzig Corpora Collection, compiled from Internet sources at the University of Leipzig.
  • Information is a newspaper corpus, consisting of 14.780 articles from the publicably searchable archive of the Danish newspaper daily Information (1996-2008). The corpus contains about 92 million words and was kindly made available by Johannes Wehner in the context of a proposed joint semantic web project involving Information, VISL and GrammarSoft.
  • Facebook: This corpus was compiled for ISK's Velux-funded hate speech project (XPEROHS) and contains comment threads from political and mass media FB pages between late 2017 to mid 2018. There are subcorpora filtered for posts concerning minorities in Denmark (FBmin) and for posts on the public service media sites (FBmin DR/TV2).
  • Twitter: This corpus was compiled for the Velux-funded (XPEROHS) project (2018-2022) and contains tweets harvested during the project period, using the Twitter query API. The overall corpus is an unabridged monitor corpus statistically representative of Danish Twitter as a whole, but there are subcorpora filtered for tweets involving minorities in Denmark. It is also possible to search the Corona period (2020/21) in isolation.
  • VIMU contains data from a didactical site on Danish and German border history. The corpus was used in the Eured project exploring "Constructions fo European and National Identities in Educationa Media".
  • Danish Literature 1800-1940 consists of 509 classical/older literary texts, with expired copyright, from two sources, both freely accessible online: (a) the Project Gutenberg collection and (b) Det Kongelige Bibliotek (The Royal Library).

    Portuguese corpus sources:

  • Europarl is the Portuguese part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
  • Wikipedia is the Portuguese part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • Público: A large corpus of European Portuguese (1991-1998), containing articles and other material from the Público newpaper (180 million words). The corpus was compiled by Linguateca and is freely available online. The corpus was morphosyntactically annotated with the PALAVRAS parser as part of theAC/DC project, a joint venture between VISL and the Processamento computacional do português initiative.
  • Folha de São Paulo: A corpus of Brazilian Portuguese, containing one year's collection of the Folha de São Paulo newspaper (1994), about 25 million words. Like the CETEMPúblico, this is a Linguateca corpus, PALAVRAS-annotated within the AC/DC project, and available online
  • COLONIA: A corpus of historical Brazilian Portuguese (100 texts, ~ 5 million words), covering the entire period of colonial Brazil. The corpus was developed at the University of Cologne and provided by Marcos Zampieri. For more information, see the project's homepage. The CorpusEye annotation retains original wordforms but internally attempts orthographical normalisation to allow POS and syntactic tagging.
  • NETLANG: A CMC corpus for hate speech research consisting of reader/viewer comments (mostly YouTube, but also Público and Sol newspaper sites, ~ 7 million words). The corpus was compiled at the University of Minho (Braga, Portugal) and has a larger, English sister corpus. For more information, see the project's homepage. The 2022 CorpusEye annotation with PALAVRAS includes morphosyntax, dependency trees, semantic roles and verb frames. The parser was genre-adapted for this project and handles social media jargon, slang and some regionalism, as well as emoji classification and automatic spellchecking/normalization.
  • Portuguese Literature: Classical Portuguese literature, both European and Brazilian, comprising about 450 novels from the ELTeC and Gutenberg collections.
  • Portuguese Blogs: A large corpus of Portuguese blogs collected between 2013 and 2017, namedBlogSet-BR and described in Santos, Woloszyn & Vieira, 2018.

    Some other Portuguese texts at this site are corpus samples that have been tagged with the PALAVRAS parser for testing and evaluation purposes, in cooperation with the following research teams:

  • Speech data: Annotated data from the C-ORAL-Brasil project, the NURC Digital project and the CORDIAL-SIN project
    .
  • Historical texts: The TYCHO BRAHE Corpus of Historical Portuguese
  • Modern texts: The NILC project
  • Ad corpus: 580 advertisements from the 2005 and 2006 editions of the Portuguese journals Activa, Lux Woman, GQ, Visão and Caras Decoração, collected by Alexandra Pinto at FLUP, comprising texts of at least 15 content words.

    German corpus sources:
  • Europarl is the German part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
  • Wikipedia is the German part of a 9 language, Google-style snapshot of Wikipedia, The Free Encyclopedia (originally 2005, renewed for German 2018). However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • bzk: Bonner Zeitungskorpus. Password protected for ISK-researchers only.
  • mak:Mannheimer Korpora. Password protected for ISK-researchers only.
  • ecide3: Frankfurter Rundschau newspaper text (ca. 1992), as compiled for the multilingual ECI-collection by the European chapter of the Association of Computational Linguistics (EACL). Password protected for ISK-researchers only.
  • Leipzig Internet Corpus contains German Internet text data (primarily news) compiled at the University of Leipzig (Leipzig Corpora Collection).
  • Facebook: This corpus was compiled for ISK's Velux-funded hate speech project (XPEROHS) and contains comment threads from political and mass media FB pages between late 2017 to mid 2018. There are subcorpora filtered for posts concerning minorities in Germany (FBmin).
  • Twitter: This corpus was compiled for the Velux-funded (XPEROHS) project (2018-2022) and contains tweets harvested during the project period, using the Twitter query API. The overall corpus is an unabridged monitor corpus statistically representative of Danish Twitter as a whole, but for the first couple of years there are also smaller subcorpora filtered for tweets involving minorities in Germany. It is also possible to search the Corona period (2020/21) separately.

    English corpus sources:
  • Europarl is the English part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
  • Wikipedia is the English part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • BNC: British National Corpus, containing both written and spoken data. Password protected for ISK-researchers only.
  • KEMPE: 'Korpus of Early Modern Playtexts in English', was initially compiled by Lene B. Petersen and Marcus X. Dahl, in association with VISL, SDU, 2001-2003. The fully searchable version of the corpus was prepared by Lene B. Petersen and Eckhard Bick, July 2004, and may be freely accessed online without a password. Please report any mis-tagged word forms to lene.petersen@uwe.ac.uk.
  • Chat is a corpus of 4 different chat logs from Project JJ (http://www.projectjj.com), administrated by Tino Didriksen. The logs were collected between August 2002 and August 2004, and cover the topics (a) Harry Potter, (b) Goth Chat, (c) X Underground and (d) Amarantus: War in New York.
  • UCLA CSA television news contains transcripts of English-language television news (2005-2012), compiled by the Red Hen lab. The corpus was morphosyntactically and semantically annotated with the EngGram parser. Among other things, the annotation covers semantic roles and EFN verb frames (http://framenet.dk).
  • Enron e-mails is a corpus of corporate e-mails, called the Enron Email Dataset, and made available for research by William Cohen on his website. The data was originally made public, and posted to the web, by the (US) Federal Energy Regulatory Commission during its investigation (history and credits).
  • Wikipedia Talkpages is a "speech-like" corpus of Wikipedia author discussions, called the Wikipedia Talk Page Conversations, and made available for research by Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg at their corpus download page, together with their article Echoes of power: Language effects and power differences in social interaction.
  • Supreme Court Dialogues is a speech corpus of about 50.000 utterances by 300 participants in 204 law suits. The was made available by Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg at their corpus download page, together with their article Echoes of power: Language effects and power differences in social interaction, building on earlier work by Timothy W. Hawes in his M.A. thesis.
    French corpus sources:
  • Europarl is the French part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
  • Wikipedia is the French part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • ecifr1 Le Monde newspaper text (1989/1990), as compiled for the multilingual ECI-collection by the European chapter of the Association of Computational Linguistics (EACL). Password protected for ISK-researchers only.
  • Ananas, also password-protected, consists of part of ecifr1 as well as other news excerpts from the Ananas project. Joint Venture with Susanne Salmon-Alt (ATILF - Loria-LED
    Spanish corpus sources:
  • Europarl is the Spanish part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
  • Wikipedia is the Spanish part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • ecies2 El Diario Sur newspaper text (April & September 1991), as compiled for the multilingual ECI-collection by the European chapter of the Association of Computational Linguistics (EACL). Password protected for ISK-researchers only.
  • CAMTIE is a news corpus with Spanish data from the 1990'ies, covering the Cambio and Tiempo magazines.
  • The Spanish Internet corpus contains data from around 2009 and was compiled using a crawl engine. The purpose of the corpus was to provide a broad lexical base for a project on semantic role annotation.

    Esperanto corpus sources:
  • TTT Internet corpus: This corpus was compiled from a random crawl of Esperanto pages on the internet, performed once in 2004 and once in 2009. Other-language and binary sections were filtered out, and dozens of encoding conventions unified into iso-latin-1, but for better or worse, the Internet corpus is different from the literary, wiki and news corpora, containing a larger portion of non-standard language usage, typing errors etc.
  • Wikipedia Wikipedia-2005 is the Esperanto part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances. Wikipedia-2010 is an equivalent, later - and hence larger - snapshot, using the raw version of Apertiums Esperanto Wikipedia corpus, compiled by Jacob Norfalk.
  • Eventoj: Newspaper text from the internet version of the 2-weekly Eventoj magazine (1992-2002), published by the Budapest based LINGVO studio. Together with other material and Esperanto-services, Eventoj archives are accessible at the Eventoj Esperanto Center. Corpus use was kindly permitted by László Szilvási.
  • Monato: Monthly news magazine with international topics. Monato, a kind of Esperanto "Newsweek", is published by Flandra Esperanto-Ligo and has a 25-year history. Files with compiled back issue articles are on-line available at Edmund Grimley Evans' Tekstaro page.
  • Zamenhof classics: Esperanto texts from the Zamenhof period (Biblio, Andersen's and Grimm's Fabeloj, La Faraono, Proverbaro, Revizoro and Marta). The texts are electronically available on cd-rom, kindly provided by Wolfram Diestel.
  • Esperanto literature: Esperanto literature on the internet, both original and translated, the main sources being Tekstaro de Esperanto and eLibrejo.
  • The E-mail corpus contains private communications and has restricted access (contributors only).

    Italian corpus sources:
  • Europarl is the Italian part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
  • Wikipedia is the Italian part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • Italian Literature: Classical Italian literature, comprising about 70 novels from the ELTeC collection.

    Romanian corpus sources:
  • The Business corpus was compiled by Arina Greavu from Revista Capital (1998-2005) and Adevarul Economic (1999-2004). Only out-of-context concordance quotes are provided, not entire articles.

    Swedish corpus sources:
  • GöteborgsPosten is a Swedish newspaper corpus compiled by Leif Grönqvist from 12 year collections (1992-2003) of Göteborgs-Posten. In un-annotated form, the corpus is also searchable at Språkbanken's website. The CorpusEye search interface does allow grammatical/syntactic, but will only show single-sentence concordances, without context.
  • Europarl is the Swedish part (both original and translated) of the European Parliament Proceedings Parallel Corpus 1996-2003, which is a brain child of Philipp Koehn, and available without copyright at his website.
    The Leipzig corpus is the Swedish section of the Leipzig Corpora Collection, compiled from Internet sources at the University of Leipzig.

    Norwegian corpus sources:
  • Wikipedia is the Norwegian part of a 9 language, Google-style snapshot (December 2005) of Wikipedia, The Free Encyclopedia. However, what CorpusEye presents, is a linguistic, not a knowledge ressource. We provide neither excerpts nor source articles, but statistics from grammatically annotated sentence concordances.
  • The Leipzig corpus is the Norwegian section of the Leipzig Corpora Collection, compiled from Internet sources at the University of Leipzig.
  • Norwegian Literature: Classical Norwegian literature, comprising about 60-70 novels from the ELTeC collection.

    Other sources:
    Some further material, for some of the languages, was provided by ISK members and integrated into this site to allow easier internal access for statistical and distributional research. Please feel free to contact us if you have corpus material yourself that you would like to have annotated and made searchable through this site.