CQP speak: Instead of normal text searches, or menu-based "refine" searches, you can type cqp-searches directly on the search line. Don't worry, you needn't - the interface will translate your menu choices into cqp -, but if you are familiar with cqp, this may be a fast option. By the way, cqp-translations of text or menu searches are shown at the top of every concordance page, so you can experiment using cut & paste, changing parts of an existing cqp-expression rather than writing one from scratch. The basic syntax for a cqp-search is the following:
Frequency lists: A frequency list is made up of the n highest ranking words at sort position - where n can be set in the freq items menu (default is a maximum of 100). The sort position is defined as left/right context or left/right edge of the highlighted hit strings. You can specify positions inside the string or a more distant context by setting the offset value, counting right (+) or left (minus) from a standard edge or context position. The list header contains the corpus name and the overall number of hits in this corpus. If several corpora were searched at the same time, individual frequency list will be shown for each. The list itself shows the individual frequency of the words encountered at sort position. num is the absolute number of occurences, freq/conc the percentage out of all hits (i.e. the concordance), and freq/corp is the in-corpus frequency normalized to 1:100,000,000. The latter is shown together with a weighted frequency, freq/norm, if the rel button is used (relative frequency). The weighted frequency is used for ranking, so typical/interesting hits can be sorted to the top. This is achieved by dividing the actual frequency by a standard "lexical" frequency taken from a multi-genre mix of background corpora (itself normalized to 1:100,000). The precise metric is the square of the local frequency (freq/conc) divided by overall lexical frequency, times 100,000. A word with an in-concordance frequency of 1%, and a norm frequency of 1:10,000 will receive a ranking value of 1. The same goes for a 10% word with a 1:100 norm frequency or an 0.1% word with a 1:1,000,000 norm frequency. If a word w1 in a concordance list has a freq/norm value 900 times higher than another word w2 in the same list, this can either mean that the w1 is 30 times more frequent than w2 in the search context (because local frequencies are squared), or that the standard frequency of w1 is 900 times lower than that of w2. Standard frequencies are set at a minimal value of 1:10,000,000 for rare words. In order to compensate for spelling errors and individual proper nouns, words with only one or two occurences are frequency-punished with a factor 0.01 and 0.02 respectively.
Info: The Info button leads to a full-sentence pop-up window, providing more context than the concordance (at least a sentence, for some corpora a paragraph), with all grammatical attributes shown (cp. Popup tag window). Also shown, at the bottom, is a reference-id from the corpus in question. It may contain a chunk or sentence id, and - for some corpora - source information like genre, date etc.
Phrases (Syntagmata): Only in treebanks will there be an explicit mark-up of phrase structure. CG corpora carry all information in a word based fashion. A phrase's function will be marked on its head, i.e. the noun in a noun phrase, the adjective in an adjective phrase. If you want to search for and highlight not only the head, but a complete phrase, you have to specify its dependents in your search, too. Let's say you are looking for a direct object np immediately following a verb in the infinitive. A search with 2 fields, verb/infinitive and noun/object, will only find nouns without premodifiers, since these would isolate the verb from its object. Therefore, you will have to allow for a third field (adnominal/prenominal) in the middle, marked with the *-operator for 'zero or more instances' - effectively allowing for np's rather than nouns only. In fact, if you want to allow for all np's, you need to check adverbial adject on the middle field, too, since a prenominal adjective may itself have a dependent, typically an adverb, like very in a very old car.
Polylexicals (Multil-word expressions): Most CG parsers do their own tokenization, fusing certain expressions, like names or complex prepositions, into one "word", to facilitate syntactic analysis: Iorek_Byrnisson, i_stedet_for. Polylexicals will still be visible in concordance format, and will show up as a single context item in frequency sorting. If you want to search for a word you know is treated as a polylexical, add uncerscores (_) in the search string, as shown in the examples. In cqp-speak, spaces will suffice, e.g. [word="i stedet for"].
Popup tag window: Touching a word in the concordance with your mouse pointer will give you a popup with grammatical information for the word in question, with the following information:
Regular expressions (regex): To work your magic on character strings. Regular expressions can be used instead of ordinary text always anywhere in the interface. At the simplest level, you can use optionality and repetition operators, and a dot (.) as a dummy character:
Another useful regex feature are sets of characters. Sets are enclosed in angular brackets [], and ranges can be defined with a hypen [c1-c2]. Negated sets start with a carret [^...]. Ordinary sets [] are sets of characters. For complex 'sets' of expressions, round brackets are used, and set items are separatet by the '|'-operator (logical OR).
A number of protected symbols lets you define start and end of a string or non-printing characters like tabs and line breaks. Also, som sets are pre-defined as protected symbols:
At the most advanced level, regular expressions allow variables. In fact, any bracketed part () of a regular expressions is regarded as a variable and can be referred to later in the same expression. Variables are named as back-slashed numbers, counting round brackets from left to right as \1, \2, \3 etc. However, cqp does not allow this particular feature. You can still use it in the old interface (rectangular flags), on running text corpora. \s([gc][aeiou])[^ ]+( \1\w+)+\s finds you "gaelic" alliteration rhyms in an English corpus. \s\w(\w+)-\w\1\s, surrounded by single spaces, finds you lots of "willy-nilly"-constructions as well as some "cut-out" and "four-hour" cases. \s[a-z]([a-z]+))\-[a-z]\1\s does the same, but avoids soccer results.
Start of sentence: In the newer corpora, each sentence has a start-marker (¤) which is regarded as a "word-form", and can be used to look for sentence-initial item, i.e. finite verbs in inverted Danish sentences.