cqp help

HELP

Clauses can be searched for through their head verb, i.e. the finite verb in finite clauses, or the first non-finite verb in non-finite clauses. In the menu-based interface, choose a corresponding verb category (e.g. part of speech = verb, morphology = finite) or clause category under Predicator, and combine this with a function category for the clause, e.g. addnominal-postnominal for a relative clause, or direct/accusative object for an object clause. Note that in some older corpora, the original, complex, Constraint Grammar tag is maintained (e.g. FS-N< for a relative clause), and only cqp-speak will make it searchable [func=".*FS-N<.*"]. An expression like [extra="(mv|aux|rel|interr|.*cl)" & func=".*N<.*"] will work for all corpora.

CQP speak: Instead of normal text searches, or menu-based "refine" searches, you can type cqp-searches directly on the search line. Don't worry, you needn't - the interface will translate your menu choices into cqp -, but if you are familiar with cqp, this may be a fast option. By the way, cqp-translations of text or menu searches are shown at the top of every concordance page, so you can experiment using cut & paste, changing parts of an existing cqp-expression rather than writing one from scratch. The basic syntax for a cqp-search is the following:

text words are written in quotes
attributes of fields are written in angular brackets. Each token gets one bracket.
- word form: [word="children"]
- lexeme: [lex="child"]
- part of speech: [pos="N"] (noun)
- morphology/inflexion: [morph=".*P.*"] (plural)
- syntactic function: [syn="SUBJ>"] (subject left of verb
- secondary tags: [extra=".*Hprof.*"] (human professional)
fields applying to the same word token are joined in the same bracket, and are linked with a boolean operator. For clarity, fields may be enclosed in ordinary brackets (). Fields can be negated:
- '&'-operator (AND): [lex="child" & func=".*ACC.*"], same as [(lex="child") & (func=".*ACC.*")].
- '|'-operator (OR): [lex="child" | lex="daughter" | lex="son"].
- '!'-operator (NEGATION): [lex="child" & ! func=".*ACC.*"].
- An empty [] matches any token.

Both text words and attributes allow regular expressions (cf.). Thus, [word="child.*"] will give you all word starting with child. Note that you will have to fill in .* dummies, if you are looking for only one of several tags in a field, as in [morph=".*P.*"] and [extra=".*Hprof.*"]. The morph field has more than one tags for most languages, and there is always the risk of there being more than one secondary tag. And, though rarely, syntactic function may be marked as ambiguous or unresolved. Only the word and lex fields are safely 1-item fields.

Frequency lists: A frequency list is made up of the n highest ranking words at sort position - where n can be set in the freq items menu (default is a maximum of 100). The sort position is defined as left/right context or left/right edge of the highlighted hit strings. You can specify positions inside the string or a more distant context by setting the offset value, counting right (+) or left (minus) from a standard edge or context position. The list header contains the corpus name and the overall number of hits in this corpus. If several corpora were searched at the same time, individual frequency list will be shown for each. The list itself shows the individual frequency of the words encountered at sort position. num is the absolute number of occurences, freq/conc the percentage out of all hits (i.e. the concordance), and freq/corp is the in-corpus frequency normalized to 1:100,000,000. The latter is shown together with a weighted frequency, freq/norm, if the rel button is used (relative frequency). The weighted frequency is used for ranking, so typical/interesting hits can be sorted to the top. This is achieved by dividing the actual frequency by a standard "lexical" frequency taken from a multi-genre mix of background corpora (itself normalized to 1:100,000). The precise metric is the square of the local frequency (freq/conc) divided by overall lexical frequency, times 100,000. A word with an in-concordance frequency of 1%, and a norm frequency of 1:10,000 will receive a ranking value of 1. The same goes for a 10% word with a 1:100 norm frequency or an 0.1% word with a 1:1,000,000 norm frequency. If a word w1 in a concordance list has a freq/norm value 900 times higher than another word w2 in the same list, this can either mean that the w1 is 30 times more frequent than w2 in the search context (because local frequencies are squared), or that the standard frequency of w1 is 900 times lower than that of w2. Standard frequencies are set at a minimal value of 1:10,000,000 for rare words. In order to compensate for spelling errors and individual proper nouns, words with only one or two occurences are frequency-punished with a factor 0.01 and 0.02 respectively.

Info: The Info button leads to a full-sentence pop-up window, providing more context than the concordance (at least a sentence, for some corpora a paragraph), with all grammatical attributes shown (cp. Popup tag window). Also shown, at the bottom, is a reference-id from the corpus in question. It may contain a chunk or sentence id, and - for some corpora - source information like genre, date etc.

Phrases (Syntagmata): Only in treebanks will there be an explicit mark-up of phrase structure. CG corpora carry all information in a word based fashion. A phrase's function will be marked on its head, i.e. the noun in a noun phrase, the adjective in an adjective phrase. If you want to search for and highlight not only the head, but a complete phrase, you have to specify its dependents in your search, too. Let's say you are looking for a direct object np immediately following a verb in the infinitive. A search with 2 fields, verb/infinitive and noun/object, will only find nouns without premodifiers, since these would isolate the verb from its object. Therefore, you will have to allow for a third field (adnominal/prenominal) in the middle, marked with the *-operator for 'zero or more instances' - effectively allowing for np's rather than nouns only. In fact, if you want to allow for all np's, you need to check adverbial adject on the middle field, too, since a prenominal adjective may itself have a dependent, typically an adverb, like very in a very old car.

Polylexicals (Multil-word expressions): Most CG parsers do their own tokenization, fusing certain expressions, like names or complex prepositions, into one "word", to facilitate syntactic analysis: Iorek_Byrnisson, i_stedet_for. Polylexicals will still be visible in concordance format, and will show up as a single context item in frequency sorting. If you want to search for a word you know is treated as a polylexical, add uncerscores (_) in the search string, as shown in the examples. In cqp-speak, spaces will suffice, e.g. [word="i stedet for"].

Popup tag window: Touching a word in the concordance with your mouse pointer will give you a popup with grammatical information for the word in question, with the following information:

Word form
Lexeme
Secondary tags, if any (e.g. subcategory, semantic class, special attachment markers, clause type)
POS (part of speech)
Inflexion
Syntactic function

Regular expressions (regex): To work your magic on character strings. Regular expressions can be used instead of ordinary text always anywhere in the interface. At the simplest level, you can use optionality and repetition operators, and a dot (.) as a dummy character:

.? = zero or one (optionality operator), e.g. "øve.?" (øve, øver, øvet, øves), or - with an optional 'l' - "l?øve.?" (øve ... + løve, løver, løves)
.* = zero or more (greedy, i.e. as long a match as possible), e.g. "hus.*" (hus, huse, husene, husar, husarrest)
.+ = one or more (greedy, i.e. as long a match as possible), e.g. "hus.+" (huse, husene etc., ikke hus), or "oh+" (oh, ohh, ohhh, ...)
.*? = zero or more (non-greedy, i.e. as short a match as possible)
.+? = one or more (non-greedy, i.e. as short a match as possible)

Note that these are the same operators that are used in the menu-based part of the interface (refine search) as optionality and repition operators for entire search fields.

Another useful regex feature are sets of characters. Sets are enclosed in angular brackets [], and ranges can be defined with a hypen [c1-c2]. Negated sets start with a carret [^...]. Ordinary sets [] are sets of characters. For complex 'sets' of expressions, round brackets are used, and set items are separatet by the '|'-operator (logical OR).

ordinary set: [aeiou] (vowels)
range set: [a-z] (alphabet), also combined with individual letters [a-zæøå] or another range [a-zA-Z0-9]
negated set: [^aeiou], [^a-z], [^a-zA-ZæøåÆØÅ]
complex sets: (wine|beer|milk|tea|coffee), also with dummies and character sets, (.*i[sz]e|.*ate) (organise, organize, validate), equivalent to (.*(i[sz]e|ate)) or (.*(i[sz]|at)e)

A number of protected symbols lets you define start and end of a string or non-printing characters like tabs and line breaks. Also, som sets are pre-defined as protected symbols:

\t = tab
\n = line break (newline)
\s = space character, tab or newline (plus windows line feed \f and carriage return \r)
\w = word character (i.e. an alphanumeric letter [a-zA-Z0-9])
\d = digit (i.e. a number [0-9])

The \w symbol is a short cut to circumvent letter sets. However, it is a little unsafe for non-English languages, since accented letters are not necessarily included, depending on system settings. For Danish, for instance, [a-zA-ZæøåÆØÅ] is safest. Another solution are 'inverted' symbols, in upper case, which negate the original set symbol. Thus, '\S' means anything but a space character, so it will include accented letters and æøå as well (together with parenthesis and the like ...).

^ = start of string(-line), used in initial position, e.g. /^anti/ (starting with 'anti-', not 'Chianti'.
$ = end of string(-line), used in final position, e.g. /anti$/ ('Chianti', not 'anti-')

Start and end markers are presupposed in this interface, so you won't need them. In fact, if you want anti to match both Chianti and antidote, you have to use ".*anti.*".

At the most advanced level, regular expressions allow variables. In fact, any bracketed part () of a regular expressions is regarded as a variable and can be referred to later in the same expression. Variables are named as back-slashed numbers, counting round brackets from left to right as \1, \2, \3 etc. However, cqp does not allow this particular feature. You can still use it in the old interface (rectangular flags), on running text corpora. \s([gc][aeiou])[^ ]+( \1\w+)+\s finds you "gaelic" alliteration rhyms in an English corpus. \s\w(\w+)-\w\1\s, surrounded by single spaces, finds you lots of "willy-nilly"-constructions as well as some "cut-out" and "four-hour" cases. \s[a-z]([a-z]+))\-[a-z]\1\s does the same, but avoids soccer results.

Start of sentence: In the newer corpora, each sentence has a start-marker (¤) which is regarded as a "word-form", and can be used to look for sentence-initial item, i.e. finite verbs in inverted Danish sentences.