Guidelines for the Normalization of Historical Data

Historical data shows a significant extent of variation when we compare it to modern standard data. This is a massive hindrance for an automatic analyis of such data (e.g. automatic part-of-speech tagging). For this purpose, we developed a set of guidelines to map historical wordforms to modern wordforms. The basic objective is to facilitate the automatic analysis of historical data.

This page gives a brief overview of our guidelines; the full guidelines can be found in the following publication:

  • Julia Krasselt, Marcel Bollmann, Stefanie Dipper, and Florian Petran (2015). Guidelines für die Normalisierung historischer deutscher Texte / Guidelines for Normalizing Historical German Texts. Bochumer Linguistische Arbeitsberichte: 15. [PDF]

The underlying principle of our guidelines is the distinction between a normalization layer and a modernization layer. For annotation, the text is displayed vertically with separate columns for the normalized and the modernized form (and a column indicating the type of adjustment, see below).

On the normalization layer, a given historical wordform is mapped to a close modern cognate by implementing sound and/or spelling changes. The modernization layer as well as the type of adjustment remain empty.

(1) Me1 Melk, 347,01

original normalization modernization type of adjustment engl. translation
Ain Ein A
höcher hoher high
lerer Lehrer teacher
hieſ hieß was named
Anſelm Anselm Anselm

‘A high teacher was named Anselm…’

(2) Me1 Melk, 347,08

original normalization modernization type of adjustment engl. translation
mein mein my
liebes liebes dear
chind Kind child
hat hat has
groſſe große huge
marter Marter martyry
erliten erlitten suffered

‘My dear child suffered huge martyry.’

By solely implementing sound and/or spelling changes, the normalized form sometimes does not adhere to modern syntactic or semantic requirements. Thus the normalized form is adjusted to an inflectionally or semantically appropriate modern equivalent on the modernization layer, if necessary. Semantic and inflectional adjustments are marked as ‘sem’ and ‘infl’, respectively.

  • Changes in a word’s meaning (semantic widening or narrowing, amelioration or pejoration):

(3) B1 Berlin, 8v,16

original normalization modernization type of adjustment engl. translation
Petrus Petrus Peter
behilt behielt versteckte sem kept > hid
sich sich himself
vnder unter hinter sem under > behind
eyme einem a
steyne Stein stone

‘Peter hid behind a stone.’

  • Changes from an irregular past tense inflection to a regular past tense inflection:

(4) B1 Berlin, 17v,02

original normalization modernization type of adjustment engl. translation
Do da then
ſweyg schweig(t) schwieg infl remains silent > remained silent
myn mein my
lybes liebes dear
kynt Kind child

‘Then my dear child remained silent.’

Modernization also applies when a historical wordform has no close modern cognate. In this case a virtual historical wordform (as found in historical dictionaries) is used as the normalized form and a suitable translation is used for modernization. This kind of adjustment is marked with ‘ext’ (for “extinct”).

(5) Me1 Melk, 349,14

original normalization modernization type of adjustment engl. translation
do da then
chom kam came
czuhant zehant sofort ext immediately
Iudas Judas Judas
In in in
den den the
Garten Garten garden

‘Then Judas immediately came into the garden…’

Manually normalized data created via these guidelines can be used as training data for our automatic normalization tool Norma.