Version 2.1
===========

- Corrected transcription parser issue which led to some punctus elevatus
  characters not being recognized as punctuation.

- Corrected outdated metadata in file headers.

Version 2.0
===========
	
- Numerous corrections and improvements to token forms and various linguistic
  annotations, especially to norm, lemma, and char_align annotations.

  - At least 1,410,681 lines changed since ReM v1.0 (of now 51,950,706 source
    lines, or 2.72%), not counting new files.

  - New texts added to the corpus:
    - M079, M253, M254, M255, M548, M550, M551, and M552.

  - Some texts were split and others merged:
    - M121Y1 and M121Y2 -> M121Y.
    - M541H1 and M541H2 -> M541H.
    - M544 -> M544A, M544B, M544C, and M544D.

  - Removed:
    - M057.


- New conversion software that is provided along with the corpus as free
  software (GPLv3).

  - Conversion software includes a completely new transcription parser that
    better handles character-level annotations (such as of unclear or supplied
    characters) and re-tokenization.

  - Available at: https://gitlab.rub.de/comphist/rem_convert


- New data formats:

  - Data are made available in TEI format.
    - Note that the "lemma" attribute on <w> contains the normalized
      representation of a token ("norm" on ANNIS), not the lemma.  The "norm"
      attribute is used for the simplified representation ("tok_anno" on
      ANNIS).

  - Data are also available in Tabular JSON format, which is practically
    identical to the data structures that the conversion program uses
    internally, so this format includes all available annotations.

  - Data are also available in GraphML format for use with ANNIS 4.


- Improved PDF reading texts:

  - The diplomatic versions (*_dipl.pdf) now include information on unclear
    transcriptions (set in italics) and illegible text, supplied either from an
    edition or during transcription (in square brackets).  Rubrications are set
    in red.


- Simplification of tokens:

  - Version 2.0 uses fewer heuristic simplifications, which sometimes resulted
    in misleading token forms.  Instead the transcriptions are simplified in
    such a way that they are more easily searchable, but with a minimum of
    information loss.


- Tokenization of punctuation:

  - Transcribed punctuation is split from the other characters in the tokens
    where it occurs, both when it occurs at the beginning of tokens and at the
    end.

  - We include a new "join" property which reflects where such tokenization of
    punctuation has taken place, in addition to cases of univerbation and
    multiverbation.  Note that "join" thus overlaps with the "token_type" tags
    ("tokenization" annotations in ANNIS) to a degree, but it is distinct
    insofar as it also covers these cases.
