Skip to content

edsnlp.data.conll

parse_conll [source]

Load a .conll file and return a dictionary with the text, words, and entities. This expects the file to contain multiple sentences, split into words, each one described in a line. Each sentence is separated by an empty line.

If possible, looks for a #global.columns comment at the start of the file to extract the column names.

Examples:

...
11  jeune   jeune   ADJ     _       Number=Sing     12      amod    _       _
12  fille   fille   NOUN    _       Gender=Fem|Number=Sing  5       obj     _       _
13  qui     qui     PRON    _       PronType=Rel    14      nsubj   _       _
...

Parameters

PARAMETER DESCRIPTION
path

Path or glob path of the brat text file (.txt, not .ann)

TYPE: str

cols

List of column names to use. If None, the first line of the file will be used

TYPE: Optional[List[str]] DEFAULT: None

fs

Filesystem to use

TYPE: FileSystem DEFAULT: LOCAL_FS

RETURNS DESCRIPTION
Iterator[Dict]

read_conll

The ConllReader (or edsnlp.data.read_conll) reads a file or directory of CoNLL files and yields documents.

The raw output (i.e., by setting converter=None) will be in the following form for a single doc:

{
    "words": [
        {"ID": "1", "FORM": ...},
        ...
    ],
}

Example

import edsnlp

nlp = edsnlp.blank("eds")
nlp.add_pipe(...)
doc_iterator = edsnlp.data.read_conll("path/to/conll/file/or/directory")
annotated_docs = nlp.pipe(doc_iterator)

Generator vs list

edsnlp.data.read_conll returns a Stream. To iterate over the documents multiple times efficiently or to access them by index, you must convert it to a list :

docs = list(edsnlp.data.read_conll("path/to/conll/file/or/directory"))

Parameters

PARAMETER DESCRIPTION
path

Path to the directory containing the CoNLL files (will recursively look for files in subdirectories).

TYPE: Union[str, Path]

columns

List of column names to use. If None, will try to extract to look for a #global.columns comment at the start of the file to extract the column names.

TYPE: Optional[List[str]] DEFAULT: None

shuffle

Whether to shuffle the data. If "dataset", the whole dataset will be shuffled before starting iterating on it (at the start of every epoch if looping).

TYPE: Literal['dataset', False] DEFAULT: False

seed

The seed to use for shuffling.

TYPE: Optional[int] DEFAULT: None

loop

Whether to loop over the data indefinitely.

TYPE: bool DEFAULT: False

nlp

The pipeline object (optional and likely not needed, prefer to use the tokenizer directly argument instead).

TYPE: Optional[PipelineProtocol]

tokenizer

The tokenizer instance used to tokenize the documents. Likely not needed since by default it uses the current context tokenizer :

  • the tokenizer of the next pipeline run by .map_pipeline in a Stream.
  • or the eds tokenizer by default.

TYPE: Optional[Tokenizer]

converter

Converter to use to convert the documents to dictionary objects.

TYPE: Optional[AsList[Union[str, Callable]]] DEFAULT: ['conll']

filesystem

The filesystem to use to write the files. If None, the filesystem will be inferred from the path (e.g. s3:// will use S3).

TYPE: Optional[Union[FileSystem, str]] DEFAULT: None