`edsnlp.train`

`Pipeline`

Bases: Validated

New pipeline to use as a drop-in replacement for spaCy's pipeline. It uses PyTorch as the deep-learning backend and allows components to share subcomponents.

See the documentation for more details.

Parameters

PARAMETER	DESCRIPTION
`lang`	Language code TYPE: `str`
`create_tokenizer`	Function that creates a tokenizer for the pipeline TYPE: `Optional[Callable[[Self], Tokenizer]]` DEFAULT: `None`
`vocab`	Whether to create a new vocab or use an existing one TYPE: `Union[bool, Vocab]` DEFAULT: `True`
`vocab_config`	Configuration for the vocab TYPE: `Optional[Type[BaseDefaults]]` DEFAULT: `None`
`meta`	Meta information about the pipeline TYPE: `Dict[str, Any]` DEFAULT: `None`

`disabled` `property`

The names of the disabled components

`cfg` `property`

Returns the config of the pipeline, including the config of all components. Updated from spacy to allow references between components.

`get_pipe`

Get a component by its name.

Parameters

PARAMETER DESCRIPTION

name

The name of the component to get.

TYPE: str

RETURNS	DESCRIPTION
`Pipe`

`has_pipe`

Check if a component exists in the pipeline.

Parameters

PARAMETER DESCRIPTION

name

The name of the component to check.

TYPE: str

RETURNS	DESCRIPTION
`bool`

`create_pipe`

Create a component from a factory name.

Parameters

PARAMETER DESCRIPTION

factory

The name of the factory to use

TYPE: str

name

The name of the component

TYPE: str

config

The config to pass to the factory

TYPE: Dict[str, Any] DEFAULT: None

RETURNS	DESCRIPTION
`Pipe`

`add_pipe`

Add a component to the pipeline.

Parameters

PARAMETER	DESCRIPTION
`factory`	The name of the component to add or the component itself TYPE: `Union[str, Pipe]`
`name`	The name of the component. If not provided, the name of the component will be used if it has one (.name), otherwise the factory name will be used. TYPE: `Optional[str]` DEFAULT: `None`
`first`	Whether to add the component to the beginning of the pipeline. This argument is mutually exclusive with `before` and `after`. TYPE: `bool` DEFAULT: `False`
`before`	The name of the component to add the new component before. This argument is mutually exclusive with `after` and `first`. TYPE: `Optional[str]` DEFAULT: `None`
`after`	The name of the component to add the new component after. This argument is mutually exclusive with `before` and `first`. TYPE: `Optional[str]` DEFAULT: `None`
`config`	The arguments to pass to the component factory. Note that instead of replacing arguments with the same keys, the config will be merged with the default config of the component. This means that you can override specific nested arguments without having to specify the entire config. TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Pipe`	The component that was added to the pipeline.

`get_pipe_meta`

Get the meta information for a component.

Parameters

PARAMETER DESCRIPTION

name

The name of the component to get the meta for.

TYPE: str

RETURNS	DESCRIPTION
`Dict[str, Any]`

`make_doc`

Create a Doc from text.

Parameters

PARAMETER DESCRIPTION

text

The text to create the Doc from.

TYPE: str

RETURNS	DESCRIPTION
`Doc`

`call`

Apply each component successively on a document.

Parameters

PARAMETER DESCRIPTION

text

The text to create the Doc from, or a Doc.

TYPE: Union[str, Doc]

RETURNS	DESCRIPTION
`Doc`

`pipe`

Process a stream of documents by applying each component successively on batches of documents.

Parameters

PARAMETER DESCRIPTION

inputs

The inputs to create the Docs from, or Docs directly.

TYPE: Union[Iterable, Stream]

n_process

Deprecated. Use the ".set_processing(num_cpu_workers=n_process)" method on the returned data stream instead. The number of parallel workers to use. If 0, the operations will be executed sequentially.

TYPE: int DEFAULT: None

RETURNS	DESCRIPTION
`Stream`

`cache`

Enable caching for all (trainable) components in the pipeline

`torch_components`

Yields components that are PyTorch modules.

Parameters

PARAMETER DESCRIPTION

disable

The names of disabled components, which will be skipped.

TYPE: Container[str] DEFAULT: ()

RETURNS	DESCRIPTION
`Iterable[Tuple[str, TorchComponent]]`

`connected_pipes_names`

Returns a list of lists of connected components in the pipeline, i.e. components that share at least one parameter.

RETURNS	DESCRIPTION
`List[List[str]]`

`post_init`

Completes the initialization of the pipeline by calling the post_init method of all components that have one. This is useful for components that need to see some data to build their vocabulary, for instance.

Parameters

PARAMETER DESCRIPTION

data

The documents to use for initialization. Each component will not necessarily see all the data.

TYPE: Iterable[Doc]

exclude

Components to exclude from post initialization on data

TYPE: Optional[Set] DEFAULT: None

`from_config` `classmethod`

Create a pipeline from a config object

Parameters

PARAMETER	DESCRIPTION
`config`	The config to use TYPE: `Union[Dict[str, Any], Pipeline]` DEFAULT: `{}`
`vocab`	The spaCy vocab to use. If True, a new vocab will be created TYPE: `Union[Vocab, bool]` DEFAULT: `True`
`disable`	Components to disable TYPE: `Union[str, Iterable[str]]` DEFAULT: `EMPTY_LIST`
`enable`	Components to enable TYPE: `Union[str, Iterable[str]]` DEFAULT: `EMPTY_LIST`
`exclude`	Components to exclude TYPE: `Union[str, Iterable[str]]` DEFAULT: `EMPTY_LIST`
`meta`	Metadata to add to the pipeline TYPE: `Dict[str, Any]` DEFAULT: `FrozenDict()`

RETURNS	DESCRIPTION
`Pipeline`

`validate` `classmethod`

Pydantic validator, used in the validate_arguments decorated functions

`preprocess_many`

Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.

Parameters

PARAMETER DESCRIPTION

docs

TYPE: Iterable[Doc]

compress

Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information.

DEFAULT: True

supervision

Whether to include supervision information in the preprocessing

DEFAULT: True

RETURNS	DESCRIPTION
`Stream`

`collate`

Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.

Parameters

PARAMETER DESCRIPTION

batch

The batch of preprocessed samples

TYPE: Union[Iterable[Dict[str, Any]], Dict[str, Any]]

device

Should we move the tensors to a device, if so, which one?

TYPE: Optional[Union[str, device]] DEFAULT: None

RETURNS	DESCRIPTION
`Dict[str, Any]`	The collated batch

`parameters`

Returns an iterator over the Pytorch parameters of the components in the pipeline

`named_parameters`

Returns an iterator over the Pytorch parameters of the components in the pipeline

`to`

Moves the pipeline to a given device

`train`

Enables training mode on pytorch modules

Parameters

PARAMETER DESCRIPTION

mode

Whether to enable training or not

DEFAULT: True

`to_disk`

Save the pipeline to a directory.

Parameters

PARAMETER DESCRIPTION

path

The path to the directory to save the pipeline to. Every component will be saved to separated subdirectories of this directory, except for tensors that will be saved to a shared files depending on the references between the components.

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the saving process. By default, the vocabulary is excluded since it may contain personal identifiers and can be rebuilt during inference.

TYPE: Optional[Set[str]] DEFAULT: None

`from_disk`

Load the pipeline from a directory. Components will be updated in-place.

Parameters

PARAMETER DESCRIPTION

path

The path to the directory to load the pipeline from

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the loading process.

TYPE: Optional[Union[str, Sequence[str]]] DEFAULT: None

device

Device to use when loading the tensors

TYPE: Optional[Union[str, device]] DEFAULT: 'cpu'

`select_pipes`

Temporarily disable and enable components in the pipeline.

Parameters

PARAMETER DESCRIPTION

disable

The name of the component to disable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

enable

The name of the component to enable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

`Stream` [source]

`set_processing` [source]

Parameters

PARAMETER	DESCRIPTION
`batch_size`	The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: `Optional[Union[int, float, str]]` DEFAULT: `None`
`batch_by`	Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: `BatchBy` DEFAULT: `None`
`num_cpu_workers`	Number of CPU workers. A CPU worker handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components. If no GPU workers are used, the CPU workers also handle the forward call of the deep-learning components. TYPE: `Optional[int]` DEFAULT: `None`
`num_gpu_workers`	Number of GPU workers. A GPU worker handles the forward call of the deep-learning components. Only used with "multiprocessing" backend. TYPE: `Optional[int]` DEFAULT: `None`
`disable_implicit_parallelism`	Whether to disable OpenMP and Huggingface tokenizers implicit parallelism in multiprocessing mode. Defaults to True. TYPE: `bool` DEFAULT: `True`
`backend`	The backend to use for parallel processing. If not set, the backend is automatically selected based on the input data and the number of workers. "simple" is the default backend and is used when `num_cpu_workers` is 1 and `num_gpu_workers` is 0. "multiprocessing" is used when `num_cpu_workers` is greater than 1 or `num_gpu_workers` is greater than 0. "spark" is used when the input data is a Spark dataframe and the output writer is a Spark writer. TYPE: `Optional[Literal['simple', 'multiprocessing', 'mp', 'spark']]` DEFAULT: `None`
`autocast`	Whether to use automatic mixed precision (AMP) for the forward pass of the deep-learning components. If True (by default), AMP will be used with the default settings. If False, AMP will not be used. If a dtype is provided, it will be passed to the `torch.autocast` context manager. TYPE: `Union[bool, Any]` DEFAULT: `None`
`show_progress`	Whether to show progress bars (only applicable with "simple" and "multiprocessing" backends). TYPE: `bool` DEFAULT: `False`
`gpu_pipe_names`	List of pipe names to accelerate on a GPUWorker, defaults to all pipes that inherit from TorchComponent. Only used with "multiprocessing" backend. Inferred from the pipeline if not set. TYPE: `Optional[List[str]]` DEFAULT: `None`
`process_start_method`	Whether to use "fork" or "spawn" as the start method for the multiprocessing backend. The default is "fork" on Unix systems and "spawn" on Windows. "fork" is the default start method on Unix systems and is the fastest start method, but it is not available on Windows, can cause issues with CUDA and is not safe when using multiple threads. "spawn" is the default start method on Windows and is the safest start method, but it is not available on Unix systems and is slower than "fork". TYPE: `Optional[Literal['fork', 'spawn']]` DEFAULT: `None`
`gpu_worker_devices`	List of GPU devices to use for the GPU workers. Defaults to all available devices, one worker per device. Only used with "multiprocessing" backend. TYPE: `Optional[List[str]]` DEFAULT: `None`
`cpu_worker_devices`	List of GPU devices to use for the CPU workers. Used for debugging purposes. TYPE: `Optional[List[str]]` DEFAULT: `None`
`deterministic`	Whether to try and preserve the order of the documents in "multiprocessing" mode. If set to `False`, workers will process documents whenever they are available in a dynamic fashion, which may result in out-of-order but usually faster processing. If set to true, tasks will be distributed in a static, round-robin fashion to workers. Defaults to `True`. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`Stream`

`map` [source]

Maps a callable to the documents. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection. If the callable is a generator function, each element will be yielded to the stream as is.

Parameters

PARAMETER DESCRIPTION

pipe

The callable to map to the documents.

kwargs

The keyword arguments to pass to the callable.

DEFAULT: {}

RETURNS	DESCRIPTION
`Stream`

`flatten` [source]

Flattens the stream.

RETURNS	DESCRIPTION
`Stream`

`map_batches` [source]

Maps a callable to a batch of documents. The callable should take a list of inputs. The output of the callable will be flattened if it is a list or a generator, or yielded to the stream as is if it is a single output (tuple or any other type).

Parameters

PARAMETER	DESCRIPTION
`pipe`	The callable to map to the documents.
`kwargs`	The keyword arguments to pass to the callable. DEFAULT: `{}`
`batch_size`	The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: `Optional[Union[int, float, str]]` DEFAULT: `None`
`batch_by`	Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: `BatchBy` DEFAULT: `None`

RETURNS	DESCRIPTION
`Stream`

`batchify` [source]

Accumulates the documents into batches and yield each batch to the stream.

Parameters

PARAMETER DESCRIPTION

batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: BatchBy DEFAULT: None

RETURNS	DESCRIPTION
`Stream`

`map_gpu` [source]

Maps a deep learning operation to a batch of documents, on a GPU worker.

Parameters

PARAMETER	DESCRIPTION
`prepare_batch`	A callable that takes a list of documents and a device and returns a batch of tensors (or anything that can be passed to the `forward` callable). This will be called on a CPU-bound worker, and may be parallelized. TYPE: `Callable[[List, Union[str, device]], Any]`
`forward`	A callable that takes the output of `prepare_batch` and returns the output of the deep learning operation. This will be called on a GPU-bound worker. TYPE: `Callable[[Any], Any]`
`postprocess`	An optional callable that takes the list of documents and the output of the deep learning operation, and returns the final output. This will be called on the same CPU-bound worker that called the `prepare_batch` function. TYPE: `Optional[Callable[[List, Any], Any]]` DEFAULT: `None`
`batch_size`	The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: `Optional[Union[int, float, str]]` DEFAULT: `None`
`batch_by`	Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: `BatchBy` DEFAULT: `None`

RETURNS	DESCRIPTION
`Stream`

`map_pipeline` [source]

Maps a pipeline to the documents, i.e. adds each component of the pipeline to the stream operations. This function is called under the hood by nlp.pipe()

Parameters

PARAMETER DESCRIPTION

model

The pipeline to map to the documents.

TYPE: Pipeline

batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: BatchBy DEFAULT: None

RETURNS	DESCRIPTION
`Stream`

`shuffle` [source]

Shuffles the stream by accumulating the documents into batches and shuffling the batches. We try to optimize and avoid the accumulation by shuffling items directly in the reader, but if some upstream operations are not elementwise or if the reader is not compatible with the batching mode, we have to accumulate the documents into batches and shuffle the batches.

For instance, imagine a reading from list of 2 very large documents and applying an operation to split the documents into sentences. Shuffling only in the reader, then applying the split operation would not shuffle the sentences across documents and may lead to a lack of randomness when training a model. Think of this as having lumps after mixing your data. In our case, we detect that the split op is not elementwise and trigger the accumulation of sentences into batches after their generation before shuffling the batches.

Parameters

PARAMETER	DESCRIPTION
`batch_size`	The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: `Optional[Union[int, float, str]]` DEFAULT: `None`
`batch_by`	Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: `Optional[str, BatchFn]` DEFAULT: `None`
`seed`	The seed to use for shuffling. TYPE: `Optional[int]` DEFAULT: `None`
`shuffle_reader`	Whether to shuffle the reader. Defaults to True if the reader is compatible with the batch_by mode, False otherwise. TYPE: `Optional[Union[bool, str]]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Stream`

`loop` [source]

Loops over the stream indefinitely.

Note that we cycle over items produced by the reader, not the items produced by the stream operations. This means that the stream operations will be applied to the same items multiple times, and may produce different results if they are non-deterministic. This also mean that calling this function will have the same effect regardless of the operations applied to the stream before calling it, ie:

stream.loop().map(...)
# is equivalent to
stream.map(...).loop()

RETURNS	DESCRIPTION
`Stream`

`torch_components` [source]

Yields components that are PyTorch modules.

RETURNS	DESCRIPTION
`Iterable['edsnlp.core.torch_component.TorchComponent']`

`train` [source]

Enables training mode on pytorch modules

Parameters

PARAMETER DESCRIPTION

mode

Whether to enable training or not

DEFAULT: True

`eval` [source]

Enables evaluation mode on pytorch modules

`SpanAttributeMetric`

The eds.span_attribute metric evaluates span‐level attribute classification by comparing predicted and gold attribute values on the same set of spans. For each attribute you specify, it computes Precision, Recall, F1, number of true positives (tp), number of gold instances (support), number of predicted instances (positives), and the Average Precision (ap). A micro‐average over all attributes is also provided under micro_key.

from edsnlp.metrics.span_attribute import SpanAttributeMetric

metric = SpanAttributeMetric(
    span_getter=conv.span_setter,
    # Evaluated attributes
    attributes={
        "neg": True,  # 'neg' on every entity
        "carrier": ["DIS"],  # 'carrier' only on 'DIS' entities
    },
    # Ignore these default values when counting matches
    default_values={
        "neg": False,
    },
    micro_key="micro",
)

Let's enumerate (span -> attr = value) items in our documents. Only the items with matching span boundaries, attribute name, and value are counted as a true positives. For instance, with the predicted and reference spans of the example above:

pred	ref
fièvreux → neg = True du diabète → neg = False du diabète → carrier = PATIENT cancer → neg = True cancer → carrier = PATIENT	fièvreux → neg = True du diabète → neg = False du diabète → carrier = FATHER cancer → neg = False cancer → carrier = PATIENT

Default values

Note that there we don't count "neg=False" items, shown in grey in the table. In EDS-NLP, this is done by setting defaults_values={"neg": False} when creating the metric. This is quite common in classification tasks, where one of the values is both the most common and the "default" (hence the name of the parameter). Counting these values would likely skew the micro-average metrics towards the default value.

Precision, Recall and F1 (micro-average and per‐label) are computed as follows:

Precision: p = |matched items of pred| / |pred|
Recall: r = |matched items of ref| / |ref|
F1: f = 2 / (1/p + 1/f)

This yields the following metrics:

metric([ref], [pred])
# Out: {
#   'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4, 'ap': 0.17},
#   'neg': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2, 'ap': 0.0},
#   'carrier': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2, 'ap': 0.25},
# }

Parameters

PARAMETER	DESCRIPTION
`span_getter`	The span getter to extract spans from each `Doc`. TYPE: `SpanGetterArg`
`attributes`	Map each attribute name to `True` (evaluate on all spans) or a sequence of labels restricting which spans to test. TYPE: `Mapping[str, Union[bool, Sequence[str]]]` DEFAULT: `None`
`default_values`	Attribute values to omit from micro‐average counts (e.g., common negative or default labels). TYPE: `Dict[str, Any]` DEFAULT: `{}`
`include_falsy`	If `False`, ignore falsy values (e.g., `False`, `None`, `''`) in predictions or gold when computing metrics; if `True`, count them. TYPE: `bool` DEFAULT: `False`
`micro_key`	Key under which to store the micro‐averaged results across all attributes. TYPE: `str` DEFAULT: `'micro'`
`filter_expr`	A Python expression (using `doc`) to filter which examples are scored. TYPE: `Optional[str]` DEFAULT: `None`

RETURNS	DESCRIPTION
`Dict[str, Dict[str, float]]`	A dictionary mapping each attribute name (and the `micro_key`) to its metrics: `label` or micro_key : `p` : precision `r` : recall `f` : F1 score `tp` : true positive count `support` : number of gold instances `positives` : number of predicted instances `ap` : average precision

`call`

Compute the span attribute metrics for the given examples.

Parameters

PARAMETER DESCRIPTION

examples

The examples to score, either a tuple of (golds, preds) or a list of spacy.training.Example objects

TYPE: Examples DEFAULT: ()

RETURNS	DESCRIPTION
`Dict[str, Dict[str, float]]`	The scores for the attributes

`BatchSizeArg`

Bases: Validated

Batch size argument validator / caster for confit/pydantic

Examples

def fn(batch_size: BatchSizeArg):
    return batch_size


print(fn("10 samples"))
# Out: (10, "samples")

print(fn("10 words"))
# Out: (10, "words")

print(fn(10))
# Out: (10, "samples")

`TorchComponent` [source]

Bases: BaseComponent, Module, Generic[BatchOutput, BatchInput]

A TorchComponent is a Component that can be trained and inherits torch.nn.Module. You can use it either as a torch module inside a more complex neural network, or as a standalone component in a Pipeline.

In addition to the methods of a torch module, a TorchComponent adds a few methods to handle preprocessing and collating features, as well as caching intermediate results for components that share a common subcomponent.

`post_init` [source]

This method completes the attributes of the component, by looking at some documents. It is especially useful to build vocabularies or detect the labels of a classification task.

Parameters

PARAMETER DESCRIPTION

gold_data

The documents to use for initialization.

TYPE: Iterable[Doc]

exclude

The names of components to exclude from initialization. This argument will be gradually updated with the names of initialized components

TYPE: Set[str]

`preprocess` [source]

Preprocess the document to extract features that will be used by the neural network and its subcomponents on to perform its predictions.

Parameters

PARAMETER DESCRIPTION

doc

Document to preprocess

TYPE: Doc

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary (optionally nested) containing the features extracted from the document.

`collate` [source]

Collate the batch of features into a single batch of tensors that can be used by the forward method of the component.

Parameters

PARAMETER DESCRIPTION

batch

Batch of features

TYPE: Dict[str, Any]

RETURNS	DESCRIPTION
`BatchInput`	Dictionary (optionally nested) containing the collated tensors

`batch_to_device` [source]

Move the batch of tensors to the specified device.

Parameters

PARAMETER DESCRIPTION

batch

Batch of tensors

TYPE: BatchInput

device

Device to move the tensors to

TYPE: Optional[Union[str, device]]

RETURNS	DESCRIPTION
`BatchInput`

`forward` [source]

Perform the forward pass of the neural network.

Parameters

PARAMETER DESCRIPTION

batch

Batch of tensors (nested dictionary) computed by the collate method

TYPE: BatchInput

RETURNS	DESCRIPTION
`BatchOutput`

`compute_training_metrics` [source]

Compute post-gather metrics on the batch output. This is a no-op by default. This is useful to compute averages when doing multi-gpu training or mini-batch accumulation since full denominators are not known during the forward pass.

`module_forward` [source]

This is a wrapper around torch.nn.Module.__call__ to avoid conflict with the components __call__ method.

`preprocess_batch` [source]

Convenience method to preprocess a batch of documents. Features corresponding to the same path are grouped together in a list, under the same key.

Parameters

PARAMETER DESCRIPTION

docs

Batch of documents

TYPE: Sequence[Doc]

supervision

Whether to extract supervision features or not

DEFAULT: False

RETURNS	DESCRIPTION
`Dict[str, Sequence[Any]]`	The batch of features

`prepare_batch` [source]

Convenience method to preprocess a batch of documents and collate them Features corresponding to the same path are grouped together in a list, under the same key.

Parameters

PARAMETER DESCRIPTION

docs

Batch of documents

TYPE: Sequence[Doc]

supervision

Whether to extract supervision features or not

TYPE: bool DEFAULT: False

device

Device to move the tensors to

TYPE: Optional[Union[str, device]] DEFAULT: None

RETURNS	DESCRIPTION
`Dict[str, Sequence[Any]]`

`batch_process` [source]

Process a batch of documents using the neural network. This differs from the pipe method in that it does not return an iterator, but executes the component on the whole batch at once.

Parameters

PARAMETER DESCRIPTION

docs

Batch of documents

TYPE: Sequence[Doc]

RETURNS	DESCRIPTION
`Sequence[Doc]`	Batch of updated documents

`postprocess` [source]

Update the documents with the predictions of the neural network. By default, this is a no-op.

Parameters

PARAMETER DESCRIPTION

docs

List of documents to update

TYPE: Sequence[Doc]

results

Batch of predictions, as returned by the forward method

TYPE: BatchOutput

inputs

List of preprocessed features, as returned by the preprocess method

TYPE: List[Dict[str, Any]]

RETURNS	DESCRIPTION
`Sequence[Doc]`

`preprocess_supervised` [source]

Preprocess the document to extract features that will be used by the neural network to perform its training. By default, this returns the same features as the preprocess method.

Parameters

PARAMETER DESCRIPTION

doc

Document to preprocess

TYPE: Doc

RETURNS	DESCRIPTION
`Dict[str, Any]`	Dictionary (optionally nested) containing the features extracted from the document.

`pipe` [source]

Applies the component on a collection of documents. It is recommended to use the Pipeline.pipe method instead of this one to apply a pipeline on a collection of documents, to benefit from the caching of intermediate results.

Parameters

PARAMETER DESCRIPTION

docs

Input docs

TYPE: Iterable[Doc]

batch_size

Batch size to use when making batched to be process at once

DEFAULT: 1

`LinearSchedule`

Bases: Schedule

Linear schedule for a parameter group. The schedule will linearly increase the value from start_value to max_value in the first warmup_rate of the total_steps and then linearly decrease it to end_value.

Parameters

PARAMETER	DESCRIPTION
`total_steps`	The total number of steps, usually used to calculate ratios. TYPE: `Optional[int]` DEFAULT: `None`
`max_value`	The maximum value to reach. TYPE: `Optional[Any]` DEFAULT: `None`
`start_value`	The initial value. TYPE: `float` DEFAULT: `0.0`
`path`	The path to the attribute to set. TYPE: `Optional[Union[str, int, List[Union[str, int]]]]` DEFAULT: `None`
`warmup_rate`	The rate of the warmup. TYPE: `float` DEFAULT: `0.0`
`end_value`	The final value to reach after the decay phase. Defaults to 0.0. TYPE: `float` DEFAULT: `0.0`

`ScheduledOptimizer`

Bases: Optimizer

Wrapper optimizer that supports schedules for the parameters and easy parameter selection using the key of the groups dictionary as regex patterns to match the parameter names.

Schedules are defined directly in the groups, in place of the scheduled value.

Examples

optim = ScheduledOptimizer(
    cls="adamw",
    module=model,
    groups=[
        # Exclude all parameters matching 'bias' from optimization.
        {
            "selector": "bias",
            "exclude": True,
        },
        # Parameters of the NER module's embedding receive this learning rate
        # schedule. If a parameter matches both 'transformer' and 'ner',
        # the first group settings take precedence due to the order.
        {
            "selector": "^ner[.]embedding"
            "lr": {
                "@schedules": "linear",
                "start_value": 0.0,
                "max_value": 5e-4,
                "warmup_rate": 0.2,
            },
        },
        # Parameters starting with 'ner' receive this learning rate schedule,
        # unless a 'lr' value has already been set by an earlier selector.
        {
            "selector": "^ner"
            "lr": {
                "@schedules": "linear",
                "start_value": 0.0,
                "max_value": 1e-4,
                "warmup_rate": 0.2,
            },
        },
        # Apply a weight_decay of 0.01 to all parameters not excluded.
        # This setting doesn't conflict with others and applies to all.
        {
            "selector": "",
            "weight_decay": 0.01,
        },
    ],
    total_steps=1000,
)

Parameters

PARAMETER	DESCRIPTION
`optim`	The optimizer to use. If a string (like "adamw") or a type to instantiate, the `module` and `groups` must be provided. TYPE: `Union[str, Type[Optimizer], Optimizer]`
`module`	The module to optimize. Usually the `nlp` pipeline object. TYPE: `Optional[Union[PipelineProtocol, Module]]` DEFAULT: `None`
`total_steps`	The total number of steps, used for schedules. TYPE: `Optional[int]` DEFAULT: `None`
`groups`	The groups to optimize. Each group is a dictionary containing: a regex `selector` key to match the parameter of that group by their names (as listed by `nlp.named_parameters()`) and several other keys that define the optimizer parameters for that group, such as `lr`, `weight_decay` etc. The value for these keys can be a `Schedule` instance or a simple value an `exclude` key that can be set to True to exclude parameters The matching is performed by running `regex.search(selector, name)` so you do not have to match the full name. Note that the order of the groups matters. If a parameter name matches multiple selectors, the configurations of these selectors are combined in reverse order (from the last matched selector to the first), allowing later selectors to complete options from earlier ones. If a selector contains `exclude=True`, any parameter matching it is excluded from optimization. TYPE: `Optional[List[Group]]` DEFAULT: `None`

`GenericScorer`

A scorer to evaluate the model performance on various tasks.

Parameters

PARAMETER	DESCRIPTION
`batch_size`	The batch size to use for scoring. Can be an int (number of documents) or a string (batching expression like "2000 words"). TYPE: `Union[int, str]` DEFAULT: `1`
`speed`	Whether to compute the model speed (words/documents per second) TYPE: `bool` DEFAULT: `True`
`autocast`	Whether to use autocasting for mixed precision during the evaluation, defaults to True. TYPE: `Union[bool, Any]` DEFAULT: `None`
`metrics`	A keyword arguments mapping of metric names to metrics objects. See the metrics documentation for more info. DEFAULT: `{}`

`TrainingData`

A training data object.

Parameters

PARAMETER	DESCRIPTION
`data`	The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components. TYPE: `Stream`
`batch_size`	The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the `eds.span_pooler` component produces a "spans" statistic, that can be used to produce batches of no more than 16 spans by setting batch_size to "16 spans". TYPE: `BatchSizeArg`
`shuffle`	The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words. TYPE: `Union[str, Literal[False]]`
`sub_batch_size`	How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. To split a batch of 8000 tokens into smaller batches of 1000 tokens each, just set this to "1000 tokens". You can also request a number of splits, like "4 splits", to split the batch into N parts each close to (but less than) batch_size / N. TYPE: `Optional[BatchSizeArg]` DEFAULT: `None`
`pipe_names`	The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes. TYPE: `Optional[AsList[str]]` DEFAULT: `None`
`post_init`	Whether to call the pipeline's post_init method with the data before training. TYPE: `bool` DEFAULT: `True`

`stat_batchify` [source]

Create a batching function that uses the value of a specific key in the items to determine the batch size. This function is primarily meant to be used on the flattened outputs of the preprocess method of a Pipeline object.

It expects the items to be a dictionary in which some keys contain the string "/stats/" and the key pattern. For instance:

from edsnlp.utils.batching import stat_batchify

items = [
    {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
    {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
]
batcher = stat_batchify("words")
assert list(batcher(items, 4)) == [
    [
        {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
        {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    ],
    [
        {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
    ],
]

Parameters

PARAMETER DESCRIPTION

key

The key pattern to use to determine the actual key to look up in the items.

RETURNS	DESCRIPTION
`Callable[[Iterable, int, bool, Literal["drop", "split"]], Iterable`

`decompress_dict` [source]

Decompress a dictionary of lists into a sequence of dictionaries. This function assumes that the dictionary structure was obtained using the batch_compress_dict class. Keys that were merged into a single string using the "|" character as a separator will be split into a nested dictionary structure.

Parameters

PARAMETER DESCRIPTION

seq

The dictionary to decompress or a sequence of dictionaries to decompress

TYPE: Union[Iterable[Dict[str, Any]], Dict[str, Any]]

`ld_to_dl` [source]

Convert a list of dictionaries to a dictionary of lists

Parameters

PARAMETER	DESCRIPTION
`ld`	The list of dictionaries TYPE: `Iterable[Mapping[str, T]]`

RETURNS	DESCRIPTION
`Dict[str, List[T]]`	The dictionary of lists

edsnlp.train

Pipeline

Parameters

disabled property

cfg property

get_pipe

Parameters

has_pipe

Parameters

create_pipe

Parameters

add_pipe

Parameters

get_pipe_meta

Parameters

make_doc

Parameters

__call__

Parameters

pipe

Parameters

cache

torch_components

Parameters

connected_pipes_names

post_init

Parameters

from_config classmethod

Parameters

validate classmethod

preprocess_many

Parameters

collate

Parameters

parameters

named_parameters

to

train

Parameters

to_disk

Parameters

from_disk

Parameters

select_pipes

Parameters

Stream [source]

set_processing [source]

Parameters

map [source]

Parameters

flatten [source]

map_batches [source]

Parameters

batchify [source]

Parameters

map_gpu [source]

Parameters

map_pipeline [source]

Parameters

shuffle [source]

Parameters

loop [source]

torch_components [source]

train [source]

Parameters

eval [source]

SpanAttributeMetric

Parameters

__call__

Parameters

BatchSizeArg

Examples

TorchComponent [source]

post_init [source]

Parameters

preprocess [source]

Parameters

collate [source]

Parameters

batch_to_device [source]

`edsnlp.train`

`Pipeline`

`disabled` `property`

`cfg` `property`

`get_pipe`

`has_pipe`

`create_pipe`

`add_pipe`

`get_pipe_meta`

`make_doc`

`call`

`pipe`

`cache`

`torch_components`

`connected_pipes_names`

`post_init`

`from_config` `classmethod`

`validate` `classmethod`

`preprocess_many`

`collate`

`parameters`

`named_parameters`

`to`

`train`

`to_disk`

`from_disk`

`select_pipes`

`Stream` [source]

`set_processing` [source]

`map` [source]

`flatten` [source]

`map_batches` [source]

`batchify` [source]

`map_gpu` [source]

`map_pipeline` [source]

`shuffle` [source]

`loop` [source]

`torch_components` [source]

`train` [source]

`eval` [source]

`SpanAttributeMetric`

`call`

`BatchSizeArg`

`TorchComponent` [source]

`post_init` [source]

`preprocess` [source]

`collate` [source]

`batch_to_device` [source]

`forward` [source]

`compute_training_metrics` [source]

`module_forward` [source]

`preprocess_batch` [source]

`prepare_batch` [source]

`batch_process` [source]

`postprocess` [source]

`preprocess_supervised` [source]

`pipe` [source]

`LinearSchedule`

`ScheduledOptimizer`

`GenericScorer`

`TrainingData`

`stat_batchify` [source]

`decompress_dict` [source]

`ld_to_dl` [source]