Tabular JSON Documents

Version 1.2.0

Adam Roussel

2024-05-21

Each document is represented by a single JSON object. The documents in a corpus or subcorpus may be represented as lines in a JSON Lines file or as a collection of JSON files in a directory. The JSON object representing a document must have these 3 properties:

name	description
id	A unique identifier for the document.
metadata	Metadata pertaining to a document’s origin, contents, etc.
token	The main sequence of tokens that is annotated in a document.

A “table” is an array of JSON objects where each of the objects may only contain certain property names, such that the values of these properties can be represented as columns in a tabular data structure. The general idea is that the top-level JSON object, representing a document, collects a number of such tables, which contain the data relating to that document.

A document may contain any number of annotation layers, represented either by properties on token objects or by top-level properties with tables as values. All layers used in a given document should be declared in the document’s metadata (see Declaring annotations).

In the following, the names of required annotations are set in boldface text and optional annotations in italics. The names for all tables should be singular nouns.¹

Metadata (`metadata`)

A single JSON object, in which each property name represents a metadata key, and the corresponding value is the metadata value. Values should be strings, but other data types are permitted, if necessary.

There is one required property for the metadata object: It must contain an annotations property, which contains declarations for the types of annotations a particular document contains. The form of these declarations is explained below in Declaring annotations.

Tokens (`token`)

A table containing the main sequence of tokens for the document. Each object in this array must have the following properties:

name	example	description
id	`t1_m2`	Token ID. Unique in document.
form	`eqūus`	Token string as it appears in the document.

All annotations refer to this sequence of tokens. References to the elements in this array must be integer indices beginning with 1.

Annotations

Declaring annotations

An annotation declaration may have one of two possible forms. In the first form, we declare the type of annotation for each key that can be found in a given document.

"annotations": {
    "pos": {
        "type": "property",
        "description": "STTS tagset"
    }
}

For this first form, only the type key is required.

name	description
type	One of the annotation types described below.
description	Description for this annotation.

In the second form, a key can be aliased to some other key. In this form, only the property use is allowed. In this example, the key pos is redirected to the pos_key:

"annotations": {
    "pos": {"use": "pos_hits"}
}

This form can be useful when annotations are to be stored under an informative custom key but one wishes to make them accessible via one of the standardized property keys (see also Standard key names below).

Annotation types

Annotations in Tabular JSON documents can be of one of eight possible types: property, token, object, relation, set, span, spanset, or hierset. With these types, most kinds of linguistic annotations should be representable.

Properties of tokens (`property`)

Annotations of the property type are included directly in the main token table as additional properties of tokens. This type of annotation is used for information that applies to single tokens, such as part-of-speech tags or alternative string representations.

Token-like objects (`token`)

A number of alternative tokenizations may be included for different purposes. Note that the presence of an alternative tokenization requires an additional “virtual token” ID on both the main token table and the table of alternative tokenizations. This is necessary in order to know how the various tokenizations are related. This table could in theory also have the same set of possible properties as the main token table.

The tables for this annotation type include the following properties:

name	example	description
id	1	Token reference.
form	`coref1`	ID for the token set.
virttok

Complex data (`object`)

Like property annotations object annotations are included in the main token table directly. However, whereas property annotations may only be strings, object annotations can be any kind of JSON data. Since such data can be difficult to handle in most data analysis software, object annotations should be used sparingly, but they are included as a sort of escape hatch for those situations where users may need to store data that is not representable with the other data types. This data type could, for instance, be useful for storing sub-token-level annotations.

Relations between tokens (`relation`)

Annotations of the relation type represent a directed edge between two entries in the main token table.

name	example	description
id	`"dep1"`	Element ID. Unique in document.
label	`det`	Label for the relation between the tokens.
from	2	The source token.
to	1	The target token.

Spans over tokens (`span`)

Spans use a pair of references to the token table, begin and end, to indicate the tokens at which they begin and end. Spans can be used for the representation of sentences, as well as elements relating to the layout of the text or the way a text is structured (headings, paragraphs, etc.).

name	example	description
id	`line1`	Element ID. Unique in document.
name	`01`	Name of element.
begin	1	The token at which the span begins.
end	10	The token at which the span ends.

Token sets (`set`)

A token set annotation has a non-unique ID, which uses the name set: All of the annotations with a particular set value belong to the same annotation set.

These tables include the following properties:

name	example	description
token	1	Token reference.
set	`coref1`	ID for the token set.
label	`thing`	Label for this set element.
substring	`Hochhaus`	Annotated substring of the token.

Sets of spans over tokens (`spanset`)

Similarly to the set annotation type, a spanset annotation uses a non-unique ID, set to identify all of the entries that belong to a given set.
Instead of a single token reference, a spanset includes begin and end references for each entry.

name	example	description
set	`met1`	ID for the token set.
begin	1	The token at which the span begins.
end	2	The token at which the span ends.
label	`MRW`	Label for this set element.

Hierarchical sets of spans over tokens (`hierset`)

The hierset annotation type can be thought of as an extension of the spanset type: It includes many of the same properties. However, it differs in that each entry has a unique ID property and a parent property, which is used to show which elements are above which others hierarchically. This could be useful for representing consituency trees or discourse structure.

name	example	description
id	`c1`	ID for the token set.
begin	1	The token at which the span begins.
end	2	The token at which the span ends.
label	`NP`	Label for this set element.
parent	`c3`	ID of the entry above this one.

Standard key names

In the interests of interoperability, the keys for some common kinds of linguistic annotations are standardized. These standardized annotation names and their types are as follows:

name	type	description
pos	`property`	Part of speech tag.
lemma	`property`	Lemma of token.
join	`property`	Used when a space was not present on either side of this token.
sentence	`span`	Sentence boundaries.
dependency	`relation`	Syntactic dependencies.
constituency	`hierset`	Constituency parses.
coreference	`spanset`	Coreference annotations.

JSON Schema

The JSON Schema for validating documents in this format is available at:

https://linguistics.rub.de/~roussel/tabular-json/schema

Cf. https://stackoverflow.com/q/338156.↩︎

Metadata (metadata)

Tokens (token)

Annotations

Declaring annotations

Annotation types

Properties of tokens (property)

Token-like objects (token)

Complex data (object)

Relations between tokens (relation)

Spans over tokens (span)

Token sets (set)

Sets of spans over tokens (spanset)

Hierarchical sets of spans over tokens (hierset)