Tokenizer for German
I have implemented a tokenizer for German in Perl, which can be used by anybody
who is interested. It optionally provides a rather detailed analysis of the
tokens (and whitespace) in the input text. Please read the license
terms before you download the software. By
downloading the software you agree to the terms stated there.
Any feedback
is heartily welcome.
Download
Usage
$ perl tokenize.perl [OPTIONS] <fileIn.text> <fileOut.tok>
If you call the script without any argument, you will get an overview of all
OPTIONS (also documented below).
Options and Features
The tokenizer reads in plain text (and optionally a list of abbreviations) and
produces a tokenized version. Click the titles to expand the following sections
for more details:
-
Output formats
- Text format:
- Tokens are separated by spaces
- 1 sentence per line
- Multiple empty lines in the input file are interpreted as paragraph
boundaries and recorded by one empty line in the output file
- XML format:
- Uses XML tags
<tok>, <sent_bound/>, <newline/>
(and, optionally, <space>)
-
Options
- Optional use of a list of abbreviations:
[-a|-abbrev <abbrev>]
<abbrev> specifies a file with a list of abbreviations
- List format: one abbreviation per line (like “etc.”)
- Abbreviations can consist of regular expressions: “/regex/” (e.g. “/str./”)
- Optional XML output:
[-x|-xml]
- The XML output optionally records all white space:
[-s|-space]
- Simple linebreaks are ignored
- Multiple empty lines are squeezed
- Leading and trailing empty lines are deleted
- The output optionally records the “types” of words
(and spaces, with XML output):
[-t|-type]
- for words:
- unmarked default:
[a-zA-Z]+
- “alphanum”: if word contais digits (among other characters)
- “mixed”: if word contains characters like brackets, quotes, …
- “allCap”, if word consists of capitalized characters only
- for numbers:
- “card”: cardinals
- “ord”: ordinals
- “year” (see below)
- for abbreviations: “abbrev”, with subtypes “sources”:
- “listed”, i.e. full abbreviation is listed in file
<abbrev>
- “regEx”, i.e. matching regex is listed in file
<abbrev>
- “nextWordLC”, i.e. next word is lower case
- for special charcters:
- “specialChar_lead”: special chars preceding a word, like “(“
- “specialChar_trail”: special chars following a word, like “)”
- “punc”: punctuation marks
- for whitespace:
- unmarked default: single space
- “tab”: tabulator
- “carrRet”: carriage return
- “unknown”: anything else
NOTE: multiple types are possible (e.g. type=”space,tab”)
- Variants of “year recognizers”:
[-y|-yearRobust].
This triggers a simplified version of date tagging:
- Year expression candidates: four-digit numbers of the form:
(1|2)[0-9][0-9][0-9], i.e. covering the years between 1000–2999
- The default recognizer carefully checks the preceding context of the
number (for expressions like ‘Januar’ or ‘Winter’ or ‘Jahr’) and will
therefore miss year expressions as in “1999 regnete es oft.”
- The “robust” recognizer ignores the context, i.e. any four-digit ordinal
starting with 1 or 2 will be interpreted as a year expression. It
therefore incorrectly analyses the cardinal in “Es gibt 1999 Optionen.”
as a year expression.
- NOTE: the default year recognizer does not work if option
-s is chosen!
NOTE: The script contains a hard-wired list of German date expressions (if
you want to change them, you will have to edit the value of the variable
“$yearRegex” in the Perl script).
Example
-
Sample input text:
Das hier ist tatsächlich ein Mini-Testtext. Er testet u.a. Abkürzungen wie
“Hauptstr. 3” und mit dem Ausdruck 4.9.2008 (oder auch 4. 9. 2008) einen
Datumsausdruck. In den Jahren 1999 und 2000 hat es 1999 Liter geregnet.
-
Text output without any options (misses the abbreviations “u.a.” and “Haupstr.”):
Das hier ist tatsächlich ein Mini-Testtext .
Er testet u.a .
Abkürzungen wie “ Hauptstr .
3 “ und mit dem Ausdruck 4.9.2008 ( oder auch 4. 9. 2008 ) einen Datumsausdruck .
In den Jahren 1999 und 2000 hat es 1999 Liter geregnet .
-
Text output with option -abbrev abbrev.lex:
Das hier ist tatsächlich ein Mini-Testtext .
Er testet u.a. Abkürzungen wie “ Hauptstr. 3 “ und mit dem Ausdruck 4.9.2008 ( oder auch 4. 9. 2008 ) einen Datumsausdruck .
In den Jahren 1999 und 2000 hat es 1999 Liter geregnet .
-
XML output with options -xml -type -abbrev abbrev.lex:
View XML output…
<?xml version="1.0" encoding="utf-8"?>
<text>
<tok>Das</tok>
<tok>hier</tok>
<tok>ist</tok>
<tok>tatsächlich</tok>
<tok>ein</tok>
<tok>Mini-Testtext</tok>
<tok type='punc'>.</tok>
<sent_bound/>
<tok>Er</tok>
<tok>testet</tok>
<tok type='abbrev' source='listed'>u.a.</tok>
<tok>Abkürzungen</tok>
<tok>wie</tok>
<tok type='specialChar_lead'>"</tok>
<tok type='abbrev' source='regEx'>Hauptstr.</tok>
<tok type='card'>3</tok>
<tok type='specialChar_trail'>"</tok>
<tok>und</tok>
<tok>mit</tok>
<tok>dem</tok>
<tok>Ausdruck</tok>
<tok type='alphanum,mixed'>4.9.2008</tok>
<tok type='specialChar_lead'>(</tok>
<tok>oder</tok>
<tok>auch</tok>
<tok type='ord'>4.</tok>
<tok type='ord'>9.</tok>
<tok type='year'>2008</tok>
<tok type='specialChar_trail'>)</tok>
<tok>einen</tok>
<tok>Datumsausdruck</tok>
<tok type='punc'>.</tok>
<sent_bound/>
<tok>In</tok>
<tok>den</tok>
<tok>Jahren</tok>
<tok type='year'>1999</tok>
<tok>und</tok>
<tok type='card'>2000</tok>
<tok>hat</tok>
<tok>es</tok>
<tok type='card'>1999</tok>
<tok>Liter</tok>
<tok>geregnet</tok>
<tok type='punc'>.</tok>
<sent_bound/>
</text>
Note that the year analyzer only recognizes the first of the year expressions
in “In den Jahren 1999 und 2000” because it checks the preceding context for
selected keywords such as “Jahr”. With the option -yearSimple, both would be
marked as year expressions.