Guidelines for the Normalization of Historical Data
Historical data shows a significant extent of variation when we compare it to modern standard data. This is a massive hindrance for an automatic analyis of such data (e.g. automatic part-of-speech tagging). For this purpose, we developed a set of guidelines to map historical wordforms to modern wordforms. The basic objective is to facilitate the automatic analysis of historical data.
This page gives a brief overview of our guidelines; the full guidelines can be found in the following publication:
- Julia Krasselt, Marcel Bollmann, Stefanie Dipper, and Florian Petran (2015). Guidelines für die Normalisierung historischer deutscher Texte / Guidelines for Normalizing Historical German Texts. Bochumer Linguistische Arbeitsberichte: 15. [PDF]
The underlying principle of our guidelines is the distinction between a normalization layer and a modernization layer. For annotation, the text is displayed vertically with separate columns for the normalized and the modernized form (and a column indicating the type of adjustment, see below).
On the normalization layer, a given historical wordform is mapped to a close modern cognate by implementing sound and/or spelling changes. The modernization layer as well as the type of adjustment remain empty.
(1) Me1 Melk, 347,01
original | normalization | modernization | type of adjustment | engl. translation |
---|---|---|---|---|
Ain | Ein | – | – | A |
höcher | hoher | – | – | high |
lerer | Lehrer | – | – | teacher |
hieſ | hieß | – | – | was named |
Anſelm | Anselm | – | – | Anselm |
‘A high teacher was named Anselm…’
(2) Me1 Melk, 347,08
original | normalization | modernization | type of adjustment | engl. translation |
---|---|---|---|---|
mein | mein | – | – | my |
liebes | liebes | – | – | dear |
chind | Kind | – | – | child |
hat | hat | – | – | has |
groſſe | große | – | – | huge |
marter | Marter | – | – | martyry |
erliten | erlitten | – | – | suffered |
‘My dear child suffered huge martyry.’
By solely implementing sound and/or spelling changes, the normalized form sometimes does not adhere to modern syntactic or semantic requirements. Thus the normalized form is adjusted to an inflectionally or semantically appropriate modern equivalent on the modernization layer, if necessary. Semantic and inflectional adjustments are marked as ‘sem’ and ‘infl’, respectively.
- Changes in a word’s meaning (semantic widening or narrowing, amelioration or pejoration):
(3) B1 Berlin, 8v,16
original | normalization | modernization | type of adjustment | engl. translation |
---|---|---|---|---|
Petrus | Petrus | – | – | Peter |
behilt | behielt | versteckte | sem | kept > hid |
sich | sich | – | – | himself |
vnder | unter | hinter | sem | under > behind |
eyme | einem | – | – | a |
steyne | Stein | – | – | stone |
‘Peter hid behind a stone.’
- Changes from an irregular past tense inflection to a regular past tense inflection:
(4) B1 Berlin, 17v,02
original | normalization | modernization | type of adjustment | engl. translation |
---|---|---|---|---|
Do | da | – | – | then |
ſweyg | schweig(t) | schwieg | infl | remains silent > remained silent |
myn | mein | – | – | my |
lybes | liebes | – | – | dear |
kynt | Kind | – | – | child |
‘Then my dear child remained silent.’
Modernization also applies when a historical wordform has no close modern cognate. In this case a virtual historical wordform (as found in historical dictionaries) is used as the normalized form and a suitable translation is used for modernization. This kind of adjustment is marked with ‘ext’ (for “extinct”).
(5) Me1 Melk, 349,14
original | normalization | modernization | type of adjustment | engl. translation |
---|---|---|---|---|
do | da | – | – | then |
chom | kam | – | – | came |
czuhant | zehant | sofort | ext | immediately |
Iudas | Judas | – | – | Judas |
In | in | – | – | in |
den | den | – | – | the |
Garten | Garten | – | – | garden |
‘Then Judas immediately came into the garden…’
Manually normalized data created via these guidelines can be used as training data for our automatic normalization tool Norma.