Version 1.2.0
2024-05-21
Each document is represented by a single JSON object. The documents in a corpus or subcorpus may be represented as lines in a JSON Lines file or as a collection of JSON files in a directory. The JSON object representing a document must have these 3 properties:
name | description |
---|---|
id | A unique identifier for the document. |
metadata | Metadata pertaining to a document’s origin, contents, etc. |
token | The main sequence of tokens that is annotated in a document. |
A “table” is an array of JSON objects where each of the objects may only contain certain property names, such that the values of these properties can be represented as columns in a tabular data structure. The general idea is that the top-level JSON object, representing a document, collects a number of such tables, which contain the data relating to that document.
A document may contain any number of annotation layers, represented either by properties on token objects or by top-level properties with tables as values. All layers used in a given document should be declared in the document’s metadata (see Declaring annotations).
In the following, the names of required annotations are set in boldface text and optional annotations in italics. The names for all tables should be singular nouns.1
metadata
)A single JSON object, in which each property name represents a metadata key, and the corresponding value is the metadata value. Values should be strings, but other data types are permitted, if necessary.
There is one required property for the metadata object: It
must contain an annotations
property,
which contains declarations for the types of annotations a particular
document contains. The form of these declarations is explained below in
Declaring annotations.
token
)A table containing the main sequence of tokens for the document. Each object in this array must have the following properties:
name | example | description |
---|---|---|
id | t1_m2 |
Token ID. Unique in document. |
form | eqūus |
Token string as it appears in the document. |
All annotations refer to this sequence of tokens. References to the elements in this array must be integer indices beginning with 1.
An annotation declaration may have one of two possible forms. In the first form, we declare the type of annotation for each key that can be found in a given document.
"annotations": {
"pos": {
"type": "property",
"description": "STTS tagset"
}
}
For this first form, only the type
key is required.
name | description |
---|---|
type | One of the annotation types described below. |
description | Description for this annotation. |
In the second form, a key can be aliased to some other key. In this
form, only the property use
is allowed. In this example,
the key pos
is redirected to the pos_key
:
"annotations": {
"pos": {"use": "pos_hits"}
}
This form can be useful when annotations are to be stored under an informative custom key but one wishes to make them accessible via one of the standardized property keys (see also Standard key names below).
Annotations in Tabular JSON documents can be of one of eight possible
types: property
, token
, object
,
relation
, set
, span
,
spanset
, or hierset
. With these types, most
kinds of linguistic annotations should be representable.
property
)Annotations of the property
type are included directly
in the main token
table as additional properties of tokens.
This type of annotation is used for information that applies to single
tokens, such as part-of-speech tags or alternative string
representations.
token
)A number of alternative tokenizations may be
included for different purposes. Note that the presence of an
alternative tokenization requires an additional
“virtual token” ID on both the main token
table and the
table of alternative tokenizations. This is necessary in order to know
how the various tokenizations are related. This table could in theory
also have the same set of possible properties as the main
token
table.
The tables for this annotation type include the following properties:
name | example | description |
---|---|---|
id | 1 | Token reference. |
form | coref1 |
ID for the token set. |
virttok |
object
)Like property
annotations object
annotations are included in the main token
table directly.
However, whereas property
annotations may only be strings,
object
annotations can be any kind of JSON data. Since such
data can be difficult to handle in most data analysis software,
object
annotations should be used sparingly, but they are
included as a sort of escape hatch for those situations where users may
need to store data that is not representable with the other data types.
This data type could, for instance, be useful for storing
sub-token-level annotations.
relation
)Annotations of the relation
type represent a directed
edge between two entries in the main token
table.
name | example | description |
---|---|---|
id | "dep1" |
Element ID. Unique in document. |
label | det |
Label for the relation between the tokens. |
from | 2 | The source token. |
to | 1 | The target token. |
span
)Spans use a pair of references to the token
table,
begin
and end
, to indicate the tokens at which
they begin and end. Spans can be used for the representation of
sentences, as well as elements relating to the layout of the text or the
way a text is structured (headings, paragraphs, etc.).
name | example | description |
---|---|---|
id | line1 |
Element ID. Unique in document. |
name | 01 |
Name of element. |
begin | 1 | The token at which the span begins. |
end | 10 | The token at which the span ends. |
set
)A token set annotation has a non-unique ID, which uses the name
set
: All of the annotations with a particular
set
value belong to the same annotation set.
These tables include the following properties:
name | example | description |
---|---|---|
token | 1 | Token reference. |
set | coref1 |
ID for the token set. |
label | thing |
Label for this set element. |
substring | Hochhaus |
Annotated substring of the token. |
spanset
)Similarly to the set
annotation type, a
spanset
annotation uses a non-unique ID, set
to identify all of the entries that belong to a given set.
Instead of a single token reference, a spanset
includes
begin
and end
references for each entry.
name | example | description |
---|---|---|
set | met1 |
ID for the token set. |
begin | 1 | The token at which the span begins. |
end | 2 | The token at which the span ends. |
label | MRW |
Label for this set element. |
hierset
)The hierset
annotation type can be thought of as an
extension of the spanset
type: It includes many of the same
properties. However, it differs in that each entry has a unique ID
property and a parent
property, which is used to show which
elements are above which others hierarchically. This could be useful for
representing consituency trees or discourse structure.
name | example | description |
---|---|---|
id | c1 |
ID for the token set. |
begin | 1 | The token at which the span begins. |
end | 2 | The token at which the span ends. |
label | NP |
Label for this set element. |
parent | c3 |
ID of the entry above this one. |
In the interests of interoperability, the keys for some common kinds of linguistic annotations are standardized. These standardized annotation names and their types are as follows:
name | type | description |
---|---|---|
pos | property |
Part of speech tag. |
lemma | property |
Lemma of token. |
join | property |
Used when a space was not present on either side of this token. |
sentence | span |
Sentence boundaries. |
dependency | relation |
Syntactic dependencies. |
constituency | hierset |
Constituency parses. |
coreference | spanset |
Coreference annotations. |
The JSON Schema for validating documents in this format is available at:
https://linguistics.rub.de/~roussel/tabular-json/schema