Weihnachtsgurke search language¶
This document describes the search language format for the Weihnachtsgurke tool.
General considerations¶
The conditions for a particular search are entered in a text file, which
can have any extension. Whitespace at line beginnings is ignored; this
means that you can use indentation to logically structure the file. Lines
beginning with #
(optionally preceded by whitespace) are comments,
and are ignored completely.
Searches¶
A Weihnachtsgurke file is divided into several searches. The text
==
on a line by itself divides the different searches in a file.
The search’s name should be on the first non-blank line, followed by a
colon. The name is followed by the search pattern, in the format
described below. Here is an example:
name1:
<searchterms>
==
name2:
<searchterms>
A double-equals at the beginning or end of the file (on a line before
name1
or after the second search terms) is optional, and ignored.
For convenience, a file containing a single search need not have a name;
in this case the name default
will be applied.
Search names¶
Search names may consist (only) of uppercase and lowercase letters of the English alphabet, and numerals 0-9.
Search terms¶
A Weihnachtsgurke search term consists of four parts: a tag matcher, a word matcher, a repeater, and a name. These together constitute a line of text (terminated with a newline character). Several lines of search terms can be combined: this requires all of them to match sequentially.
Matchers¶
Both of the matchers use Python regular expression syntax.
Either one or both of the matchers can be supplied. Neither can contain
a space character. Both matchers are anchored – they must match the
whole tag or word. If you wish to match only a prefix, include the text
.*
at the end of the matcher (a regular expression snippet which
matches any number of characters). Thus, the following line matches
only a singular common noun:
N
The following line matches a singular or plural common or proper noun:
N|NS|NPR|NPRS
The following line matches any tag beginning with N
(N
, NS
,
NEG
, ...):
N.*
The tag matcher comes at the beginning of the line, and will typically
use uppercase letters to match tags in the corpus’s tagset. The word
matcher is appended to the tag matcher, and enclosed in curly bracket
characters {}
. Thus, the following line matches the word “cat”:
{cat}
The following line matches “cat” only when it is a singular common noun:
N{cat}
Note that the matching is case sensitive. In order to match
case-insensitively, it is necessary to enclose each character in a
regular expression character class: [Cc][Aa][Tt]
will match the
word “cat” case-insensitively.
Repeater¶
The repeater specifies whether and how a match can repeat. These are inspired by regular expression syntax, but must be separated from the matcher(s) by a space character. There are three options:
- optional
- The character
?
indicates that the given term may match zero or one times. - repeat
- The character
+
indicates that the given term may match one or more times. - optional-repeat
- The character
*
indicates that the given term may match zero or more times.
Thus, the following matches an NP with a determiner, optionally a single adjective, and the noun “cat”. This includes “the cat,” “the fluffy cat,” “a cat,” etc.:
D
ADJ ?
N{cat}
The following requires there to be at least one adjective describing the cat, and permits multiple adjectives: “a fluffy cat,” “the fluffy orange cat,” etc.:
D
ADJ +
N{cat}
Finally, the following matches a modal followed by any number of adverbs (even 0) followed by an infinitive verb:
MD
ADV *
VB
Note: it is important to include the space between the matchers and
the repeater. If this is not included, the repeater will be interpreted
as part of the matcher instead. ADV *
matches optional adverbs, but
ADV*
matches the tags AD
, ADV
, ADVV
, ....
The repeat and optional-repeat matchers are incompatible with specifying a name; see below.
Name¶
The name is completely optional. If it is specified, the tag and word
of the matching line will be saved in the output in columns named
name_tag
and name_word
respectively. The name is specified by
the string `` as `` appended to the search term, followed by the name.
A name can consist (only) of uppercase and lowercase English letters and
digits 0-9. Thus, the following terms will match NPs referring to cats,
and will allow us to tabulate the kinds of adjectives used to describe
them (by examining the adj_word
column in the output):
D
ADJ as adj
N{cat}
Tips and tricks¶
Negative matches¶
Python has a facility for negative assertions in regular expressions,
which verifies that a certain expression does not match. This is
expressed by the syntax (?!
regex )
. Note that this
construction does not advance the match window. Thus, in common usage,
it should be followed by .*
outside of the negative assertion. For
an example of matching any word but only (and spelling variants), see
the following section.
TODO¶
TODO: what else to include here?
Example¶
Here is an example search file which allows us to search for negative declarative sentences with a pronoun subject which wither have or lack do support:
do:
PRO as subject
ADV *
DOD|DOP
ADV *
NEG
ADV *
VB as verb
==
simple:
PRO as subject
ADV *
VBP|VBD as verb
ADV|PRO *
NEG
{(?!only|onely).*} as foll1
.* as foll2
Adverbs are allowed to intervene freely; the simple
case also
allows pronouns to intervene between the verb and the negation, as in
I saw it not. The output of this search allows the subject and verb
to be examined (for example to eliminate errors tagging errors where
the subject is not actually a nominative case pronoun.) The regular
expression associated with foll1
is a negative match, covered in
the preceding section. It excludes cases like “I know not only Bob but
also his family.”
A complete use of this search would involve further filtering of foll1
and foll2
to eliminate cases like “He told me not to call after 8pm,”
which contains a string (“he told me not”) which without this filtering
would be counted as a failure of the do support rule to apply, whereas
it is clearly not.