Skip to content

edsnlp.data.json

read_json

The JsonReader (or edsnlp.data.read_json) reads a directory of JSON files and yields documents. At the moment, only entities and attributes are loaded.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_json("path/to/json/dir", converter="omop")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_json returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list

docs = list(edsnlp.data.read_json("path/to/json/dir", converter="omop")

Parameters

PARAMETER DESCRIPTION
path

Path to the directory containing the JSON files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

keep_ipynb_checkpoints

Whether to keep the files have ".ipynb_checkpoints" in their path.

TYPE: bool DEFAULT: False

filesystem

The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. s3:// will use S3).

TYPE: Optional[Union[FileSystem, str]] DEFAULT: None

shuffle

Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping).

TYPE: Literal['dataset', False] DEFAULT: False

seed

The seed to use for shuffling.

TYPE: int DEFAULT: 42

loop

Whether to loop over the data indefinitely.

TYPE: bool DEFAULT: False

converter

Converters to use to convert the JSON objects to Doc objects. These are documented on the Converters page.

TYPE: Optional[AsList[Union[str, Callable]]] DEFAULT: None

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Converters page.

DEFAULT: {}

RETURNS DESCRIPTION
Stream

write_json

edsnlp.data.write_json writes a list of documents using the JSON format in a directory. If lines is false, each document will be stored in its own JSON file, named after the FILENAME field returned by the converter (commonly the note_id attribute of the documents), and subdirectories will be created if the name contains / characters.

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)

doc = nlp("My document with entities")

edsnlp.data.write_json([doc], "path/to/json/file", converter="omop", lines=True)
# or to write a directory of JSON files, ensure that each doc has a doc._.note_id
# attribute, since this will be used as a filename:
edsnlp.data.write_json([doc], "path/to/json/dir", converter="omop", lines=False)

Overwriting files

By default, write_json will raise an error if the directory already exists and contains files with .a* or .txt suffixes. This is to avoid overwriting existing annotations. To allow overwriting existing files, use overwrite=True.

Parameters

PARAMETER DESCRIPTION
data

The data to write (either a list of documents or a Stream).

TYPE: Union[Any, Stream]

path

Path to either - a file if lines is true : this will write the documents as a JSONL file - a directory if lines is false: this will write one JSON file per document using the FILENAME field returned by the converter (commonly the note_id attribute of the documents) as the filename.

TYPE: Union[str, Path]

lines

Whether to write the documents as a JSONL file or as a directory of JSON files. By default, this is inferred from the path: if the path is a file, lines is assumed to be true, otherwise it is assumed to be false.

TYPE: bool DEFAULT: None

overwrite

Whether to overwrite existing directories.

TYPE: bool DEFAULT: False

execute

Whether to execute the writing operation immediately or to return a stream

TYPE: bool DEFAULT: True

converter

Converter to use to convert the documents to dictionary objects before writing them. These are documented on the Converters page.

TYPE: Optional[Union[str, Callable]] DEFAULT: None

filesystem

The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. s3:// will use S3).

TYPE: Optional[Union[FileSystem, str]] DEFAULT: None

kwargs

Additional keyword arguments to pass to the converter. These are documented on the Converters page.

DEFAULT: {}