Raw text corpora were either harvested from the internet and through provider APIs (Facebook, Twitter), downloaded from existing repositories (Leipzig Wortschatz corpora, Wikipedia dumps), licensed (ECI, DSL) or kindly provided by project partners (Oxford University, Linguateca, ATILF, NILC, Red Hen Lab, the Danish parliament and others). Some sources were scanned and OCR-converted at the ISK (Skalk) or acquired by ISK employees through private channels. For a full list of corpus credits and references see our copyright page, which is also linked from the individual corpus pages.
Grammatical corpus annotation, both morphosyntactic tags (CG) and tree-structures (PSG), was performed with Eckhard Bick's VISL parsers: PALAVRAS (Portuguese), PALAVRAS-HIS (Spanish), DanGram (Danish), GerGram (German), EngGram (English), SweGram (Swedish), NorGram (Norwegian Bokmål), EspGram (Esperanto), ItaGram (Italian) and FrAG (French), which are all accessible online (including file upload service). For Romanian, the morphological annotation was performed with Dan Tufis' probabilistic MSD tagger.
Treebank revision was supervised work involving, among others, the following VISL-students: Susanna Afonsoand Raquel Marchi (Portuguese), Ina Størner Rasmussen, Camilla Pedersen, Dorte Lønsmann and Kim Ebensgaard Jensen (Danish), and Ane Dybro Johansen (French). The treebank projects had funding support by Linguateca (Portuguese), The Nordic Council of Ministers (Danish) and ATILF (French).
More information on the VISL project as well as live grammatical analysis and a number of grammar teaching tools are available at the VISL main site or its research oriented beta version.