Reference Corpora for Middle and Early New High German - Overview

Overview

This website presents information about reference corpora for Middle High German and Early New High German.

In the early 2000s, a range of German historical linguists started an initiative with the goal of creating a diachronic reference corpus of German. To aim for this goal, several related projects applied successfully for funding at the Deutsche Forschungsgemeinschaft (DFG):

REA — Reference corpus Old German (750–1050) (2009–2013)
Karin Donhauser (Berlin), Jost Gippert (Frankfurt), Rosemarie Lühr (Jena)
REM — Reference corpus Middle High German (1050–1350) (2010–2014)
Klaus-Peter Wegera (Bochum), Claudia Wich-Reif (Bonn), Stefanie Dipper (Bochum), Thomas Klein (Bonn)
REF — Reference corpus Early New High German (1350–1650) (2011–)
Hans-Joachim Solms (Halle), Ulrike Demske (Potsdam), Stefanie Dipper (Bochum), Klaus-Peter Wegera (Bochum)
REN — Reference corpus Middle Low German and Low Rhenish (1200–1650) (2012–)
Ingrid Schröder (Hamburg), Robert Peters (Münster)

To allow for diachronic investigations, all projects closely collaborate in developing common annotation standards. The entire corpus will eventually be available and searchable via the search tool ANNIS.

Selection of texts
Both texts included in REM and texts included in REF are balanced for time period, dialect region, genre, and verse vs. prose.
Transcriptions
The transcriptions in REM and REF are diplomatic, i.e. they stay as close to the original manuscripts as possible.
Annotations
Annotations in REM and REF cover the levels lemma, morphology and part of speech. REM distinguishes between POS annotations that relate to the wordforms as such (lemma-related), and annotations that relate to the wordform in its current use (instance-related). We also developed a tagset, HiTS, specifically suited for historical German. More…
Tools for the annotation
For the annotation process of REF, the web-based annotation tool CorA was developed. It allows for annotating multiple annotation levels, editing the primary data and modifying token boundaries during the annotation process.
Searching the corpus / Availability
The REA subcorpus is already available via the search tool ANNIS, which allows searching for metadata, wordforms, annotations, and any combinations of these. The REM subcorpus will be made available in the near future; we are currently working on a TEI export of the REM data. The REF corpus is still in the annotation process.

The research reported on this website was supported by: