Tabular JSON Documents

Version 1.2.0

Adam Roussel

2024-05-21

Each document is represented by a single JSON object. The documents in a corpus or subcorpus may be represented as lines in a JSON Lines file or as a collection of JSON files in a directory. The JSON object representing a document must have these 3 properties:

name description
id A unique identifier for the document.
metadata Metadata pertaining to a document’s origin, contents, etc.
token The main sequence of tokens that is annotated in a document.

A “table” is an array of JSON objects where each of the objects may only contain certain property names, such that the values of these properties can be represented as columns in a tabular data structure. The general idea is that the top-level JSON object, representing a document, collects a number of such tables, which contain the data relating to that document.

A document may contain any number of annotation layers, represented either by properties on token objects or by top-level properties with tables as values. All layers used in a given document should be declared in the document’s metadata (see Declaring annotations).

In the following, the names of required annotations are set in boldface text and optional annotations in italics. The names for all tables should be singular nouns.1

Metadata (metadata)

A single JSON object, in which each property name represents a metadata key, and the corresponding value is the metadata value. Values should be strings, but other data types are permitted, if necessary.

There is one required property for the metadata object: It must contain an annotations property, which contains declarations for the types of annotations a particular document contains. The form of these declarations is explained below in Declaring annotations.

Tokens (token)

A table containing the main sequence of tokens for the document. Each object in this array must have the following properties:

name example description
id t1_m2 Token ID. Unique in document.
form eqūus Token string as it appears in the document.

All annotations refer to this sequence of tokens. References to the elements in this array must be integer indices beginning with 1.

Annotations

Declaring annotations

An annotation declaration may have one of two possible forms. In the first form, we declare the type of annotation for each key that can be found in a given document.

"annotations": {
    "pos": {
        "type": "property",
        "description": "STTS tagset"
    }
}

For this first form, only the type key is required.

name description
type One of the annotation types described below.
description Description for this annotation.

In the second form, a key can be aliased to some other key. In this form, only the property use is allowed. In this example, the key pos is redirected to the pos_key:

"annotations": {
    "pos": {"use": "pos_hits"}
}

This form can be useful when annotations are to be stored under an informative custom key but one wishes to make them accessible via one of the standardized property keys (see also Standard key names below).

Annotation types

Annotations in Tabular JSON documents can be of one of eight possible types: property, token, object, relation, set, span, spanset, or hierset. With these types, most kinds of linguistic annotations should be representable.

Properties of tokens (property)

Annotations of the property type are included directly in the main token table as additional properties of tokens. This type of annotation is used for information that applies to single tokens, such as part-of-speech tags or alternative string representations.

Token-like objects (token)

A number of alternative tokenizations may be included for different purposes. Note that the presence of an alternative tokenization requires an additional “virtual token” ID on both the main token table and the table of alternative tokenizations. This is necessary in order to know how the various tokenizations are related. This table could in theory also have the same set of possible properties as the main token table.

The tables for this annotation type include the following properties:

name example description
id 1 Token reference.
form coref1 ID for the token set.
virttok

Complex data (object)

Like property annotations object annotations are included in the main token table directly. However, whereas property annotations may only be strings, object annotations can be any kind of JSON data. Since such data can be difficult to handle in most data analysis software, object annotations should be used sparingly, but they are included as a sort of escape hatch for those situations where users may need to store data that is not representable with the other data types. This data type could, for instance, be useful for storing sub-token-level annotations.

Relations between tokens (relation)

Annotations of the relation type represent a directed edge between two entries in the main token table.

name example description
id "dep1" Element ID. Unique in document.
label det Label for the relation between the tokens.
from 2 The source token.
to 1 The target token.

Spans over tokens (span)

Spans use a pair of references to the token table, begin and end, to indicate the tokens at which they begin and end. Spans can be used for the representation of sentences, as well as elements relating to the layout of the text or the way a text is structured (headings, paragraphs, etc.).

name example description
id line1 Element ID. Unique in document.
name 01 Name of element.
begin 1 The token at which the span begins.
end 10 The token at which the span ends.

Token sets (set)

A token set annotation has a non-unique ID, which uses the name set: All of the annotations with a particular set value belong to the same annotation set.

These tables include the following properties:

name example description
token 1 Token reference.
set coref1 ID for the token set.
label thing Label for this set element.
substring Hochhaus Annotated substring of the token.

Sets of spans over tokens (spanset)

Similarly to the set annotation type, a spanset annotation uses a non-unique ID, set to identify all of the entries that belong to a given set.
Instead of a single token reference, a spanset includes begin and end references for each entry.

name example description
set met1 ID for the token set.
begin 1 The token at which the span begins.
end 2 The token at which the span ends.
label MRW Label for this set element.

Hierarchical sets of spans over tokens (hierset)

The hierset annotation type can be thought of as an extension of the spanset type: It includes many of the same properties. However, it differs in that each entry has a unique ID property and a parent property, which is used to show which elements are above which others hierarchically. This could be useful for representing consituency trees or discourse structure.

name example description
id c1 ID for the token set.
begin 1 The token at which the span begins.
end 2 The token at which the span ends.
label NP Label for this set element.
parent c3 ID of the entry above this one.

Standard key names

In the interests of interoperability, the keys for some common kinds of linguistic annotations are standardized. These standardized annotation names and their types are as follows:

name type description
pos property Part of speech tag.
lemma property Lemma of token.
join property Used when a space was not present on either side of this token.
sentence span Sentence boundaries.
dependency relation Syntactic dependencies.
constituency hierset Constituency parses.
coreference spanset Coreference annotations.

JSON Schema

The JSON Schema for validating documents in this format is available at:

https://linguistics.rub.de/~roussel/tabular-json/schema


  1. Cf. https://stackoverflow.com/q/338156.↩︎