Corpus Study

In this strand of the research project, we test the hypothesis that implicitly acquired morphosyntactic and orthographic skills are an essential fundament of a strong proficiency in reading and writing and should be reflected in solid knowledge about the statistical properties of orthography. We assume that spelling errors of more proficient writers are more strongly correlated with the orthographic properties of German words than those of less proficient writers.

In order to create a data base for testing these hypotheses and predicitions, we compiled the Litkey corpus, a corpus of words primary school children had produced when writing about picture stories featuring two children, Lea and Lars, and a dog, Dodo. Each child contributed up to ten stories between grades 2 and 4. The texts had been obtained in a previous research project, which aimed at evaluating a method of forstering the acquisition of German as a second language in regular classroom teaching (Frieg et al., 2013; Frieg, 2014).
For each child, we digitized each word form they had written, noting their actual spelling as well as the intended spelling, and annotated the intended word forms with respect to part-of-speech, specific orthographic features, such as consonant doubling (<Kanne>, <Bett>) or vowel-lengthening <h> (<fahren>)frequency, as well as surface features such as n-gram frequencies, orthographic consistency, and frequency, hence creating the Litkey Corpus  (see Laarmann-Quante et al., 2019).
In her PhD thesis, Ronja Laarmann-Quante used a combination of corpus linguistic methods, natural language processing and machine learning to train random forests on a subset of the Litkey Corpus so as to predict misepelt words in untrained sections of the corpus. The overall prediction accuracy on trained and untrained data was good, given the nature of the data. The random forests picked up on a number of features of words, such as its spelling consistency, that predict how likely it is that a word will be misspelt by a given child (Laarmann-Quante, 2021). Her data indicate that these influences may be modulated by the spelling proficiency of the child who produced a word, as we had hypothesized.  
In complementary analyses, we investigate influences on when individual children acquire specific orthographic phenomena of German, such as double consonant spellings. Here, we take specific graphotactic patterns into account, which we predict to play an important role in acquiring orthography implicitly.


Laarmann-Quante, R., Ortmann, K., Ehlert, A., Masloch, S., Scholz, D., Belke, E., & Dipper, S. (2019). The Litkey Corpus: A richly annotated longitudinal corpus of German texts written by primary school children. Behavior Research Methods511889–1918.