Skip to content

edsnlp.train

Pipeline

Bases: Validated

New pipeline to use as a drop-in replacement for spaCy's pipeline. It uses PyTorch as the deep-learning backend and allows components to share subcomponents.

See the documentation for more details.

Parameters

PARAMETER DESCRIPTION
lang

Language code

TYPE: str

create_tokenizer

Function that creates a tokenizer for the pipeline

TYPE: Optional[Callable[[Self], Tokenizer]] DEFAULT: None

vocab

Whether to create a new vocab or use an existing one

TYPE: Union[bool, Vocab] DEFAULT: True

vocab_config

Configuration for the vocab

TYPE: Optional[Type[BaseDefaults]] DEFAULT: None

meta

Meta information about the pipeline

TYPE: Dict[str, Any] DEFAULT: None

disabled property

The names of the disabled components

cfg property

Returns the config of the pipeline, including the config of all components. Updated from spacy to allow references between components.

get_pipe

Get a component by its name.

Parameters

PARAMETER DESCRIPTION
name

The name of the component to get.

TYPE: str

RETURNS DESCRIPTION
Pipe

has_pipe

Check if a component exists in the pipeline.

Parameters

PARAMETER DESCRIPTION
name

The name of the component to check.

TYPE: str

RETURNS DESCRIPTION
bool

create_pipe

Create a component from a factory name.

Parameters

PARAMETER DESCRIPTION
factory

The name of the factory to use

TYPE: str

name

The name of the component

TYPE: str

config

The config to pass to the factory

TYPE: Dict[str, Any] DEFAULT: None

RETURNS DESCRIPTION
Pipe

add_pipe

Add a component to the pipeline.

Parameters

PARAMETER DESCRIPTION
factory

The name of the component to add or the component itself

TYPE: Union[str, Pipe]

name

The name of the component. If not provided, the name of the component will be used if it has one (.name), otherwise the factory name will be used.

TYPE: Optional[str] DEFAULT: None

first

Whether to add the component to the beginning of the pipeline. This argument is mutually exclusive with before and after.

TYPE: bool DEFAULT: False

before

The name of the component to add the new component before. This argument is mutually exclusive with after and first.

TYPE: Optional[str] DEFAULT: None

after

The name of the component to add the new component after. This argument is mutually exclusive with before and first.

TYPE: Optional[str] DEFAULT: None

config

The arguments to pass to the component factory.

Note that instead of replacing arguments with the same keys, the config will be merged with the default config of the component. This means that you can override specific nested arguments without having to specify the entire config.

TYPE: Optional[Dict[str, Any]] DEFAULT: None

RETURNS DESCRIPTION
Pipe

The component that was added to the pipeline.

get_pipe_meta

Get the meta information for a component.

Parameters

PARAMETER DESCRIPTION
name

The name of the component to get the meta for.

TYPE: str

RETURNS DESCRIPTION
Dict[str, Any]

make_doc

Create a Doc from text.

Parameters

PARAMETER DESCRIPTION
text

The text to create the Doc from.

TYPE: str

RETURNS DESCRIPTION
Doc

__call__

Apply each component successively on a document.

Parameters

PARAMETER DESCRIPTION
text

The text to create the Doc from, or a Doc.

TYPE: Union[str, Doc]

RETURNS DESCRIPTION
Doc

pipe

Process a stream of documents by applying each component successively on batches of documents.

Parameters

PARAMETER DESCRIPTION
inputs

The inputs to create the Docs from, or Docs directly.

TYPE: Union[Iterable, Stream]

n_process

Deprecated. Use the ".set_processing(num_cpu_workers=n_process)" method on the returned data stream instead. The number of parallel workers to use. If 0, the operations will be executed sequentially.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
Stream

cache

Enable caching for all (trainable) components in the pipeline

torch_components

Yields components that are PyTorch modules.

Parameters

PARAMETER DESCRIPTION
disable

The names of disabled components, which will be skipped.

TYPE: Container[str] DEFAULT: ()

RETURNS DESCRIPTION
Iterable[Tuple[str, TorchComponent]]

connected_pipes_names

Returns a list of lists of connected components in the pipeline, i.e. components that share at least one parameter.

RETURNS DESCRIPTION
List[List[str]]

post_init

Completes the initialization of the pipeline by calling the post_init method of all components that have one. This is useful for components that need to see some data to build their vocabulary, for instance.

Parameters

PARAMETER DESCRIPTION
data

The documents to use for initialization. Each component will not necessarily see all the data.

TYPE: Iterable[Doc]

exclude

Components to exclude from post initialization on data

TYPE: Optional[Set] DEFAULT: None

from_config classmethod

Create a pipeline from a config object

Parameters

PARAMETER DESCRIPTION
config

The config to use

TYPE: Union[Dict[str, Any], Pipeline] DEFAULT: {}

vocab

The spaCy vocab to use. If True, a new vocab will be created

TYPE: Union[Vocab, bool] DEFAULT: True

disable

Components to disable

TYPE: Union[str, Iterable[str]] DEFAULT: EMPTY_LIST

enable

Components to enable

TYPE: Union[str, Iterable[str]] DEFAULT: EMPTY_LIST

exclude

Components to exclude

TYPE: Union[str, Iterable[str]] DEFAULT: EMPTY_LIST

meta

Metadata to add to the pipeline

TYPE: Dict[str, Any] DEFAULT: FrozenDict()

RETURNS DESCRIPTION
Pipeline

validate classmethod

Pydantic validator, used in the validate_arguments decorated functions

preprocess_many

Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.

Parameters

PARAMETER DESCRIPTION
docs

TYPE: Iterable[Doc]

compress

Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information.

DEFAULT: True

supervision

Whether to include supervision information in the preprocessing

DEFAULT: True

RETURNS DESCRIPTION
Stream

collate

Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.

Parameters

PARAMETER DESCRIPTION
batch

The batch of preprocessed samples

TYPE: Union[Iterable[Dict[str, Any]], Dict[str, Any]]

device

Should we move the tensors to a device, if so, which one?

TYPE: Optional[Union[str, device]] DEFAULT: None

RETURNS DESCRIPTION
Dict[str, Any]

The collated batch

parameters

Returns an iterator over the Pytorch parameters of the components in the pipeline

named_parameters

Returns an iterator over the Pytorch parameters of the components in the pipeline

to

Moves the pipeline to a given device

train

Enables training mode on pytorch modules

Parameters

PARAMETER DESCRIPTION
mode

Whether to enable training or not

DEFAULT: True

to_disk

Save the pipeline to a directory.

Parameters

PARAMETER DESCRIPTION
path

The path to the directory to save the pipeline to. Every component will be saved to separated subdirectories of this directory, except for tensors that will be saved to a shared files depending on the references between the components.

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the saving process. By default, the vocabulary is excluded since it may contain personal identifiers and can be rebuilt during inference.

TYPE: Optional[Set[str]] DEFAULT: None

from_disk

Load the pipeline from a directory. Components will be updated in-place.

Parameters

PARAMETER DESCRIPTION
path

The path to the directory to load the pipeline from

TYPE: Union[str, Path]

exclude

The names of the components, or attributes to exclude from the loading process.

TYPE: Optional[Union[str, Sequence[str]]] DEFAULT: None

device

Device to use when loading the tensors

TYPE: Optional[Union[str, device]] DEFAULT: 'cpu'

select_pipes

Temporarily disable and enable components in the pipeline.

Parameters

PARAMETER DESCRIPTION
disable

The name of the component to disable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

enable

The name of the component to enable, or a list of names.

TYPE: Optional[Union[str, Iterable[str]]] DEFAULT: None

Stream [source]

set_processing [source]

Parameters

PARAMETER DESCRIPTION
batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: BatchBy DEFAULT: None

num_cpu_workers

Number of CPU workers. A CPU worker handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components. If no GPU workers are used, the CPU workers also handle the forward call of the deep-learning components.

TYPE: Optional[int] DEFAULT: None

num_gpu_workers

Number of GPU workers. A GPU worker handles the forward call of the deep-learning components. Only used with "multiprocessing" backend.

TYPE: Optional[int] DEFAULT: None

disable_implicit_parallelism

Whether to disable OpenMP and Huggingface tokenizers implicit parallelism in multiprocessing mode. Defaults to True.

TYPE: bool DEFAULT: True

backend

The backend to use for parallel processing. If not set, the backend is automatically selected based on the input data and the number of workers.

  • "simple" is the default backend and is used when num_cpu_workers is 1 and num_gpu_workers is 0.
  • "multiprocessing" is used when num_cpu_workers is greater than 1 or num_gpu_workers is greater than 0.
  • "spark" is used when the input data is a Spark dataframe and the output writer is a Spark writer.

TYPE: Optional[Literal['simple', 'multiprocessing', 'mp', 'spark']] DEFAULT: None

autocast

Whether to use automatic mixed precision (AMP) for the forward pass of the deep-learning components. If True (by default), AMP will be used with the default settings. If False, AMP will not be used. If a dtype is provided, it will be passed to the torch.autocast context manager.

TYPE: Union[bool, Any] DEFAULT: None

show_progress

Whether to show progress bars (only applicable with "simple" and "multiprocessing" backends).

TYPE: bool DEFAULT: False

gpu_pipe_names

List of pipe names to accelerate on a GPUWorker, defaults to all pipes that inherit from TorchComponent. Only used with "multiprocessing" backend. Inferred from the pipeline if not set.

TYPE: Optional[List[str]] DEFAULT: None

process_start_method

Whether to use "fork" or "spawn" as the start method for the multiprocessing backend. The default is "fork" on Unix systems and "spawn" on Windows.

  • "fork" is the default start method on Unix systems and is the fastest start method, but it is not available on Windows, can cause issues with CUDA and is not safe when using multiple threads.
  • "spawn" is the default start method on Windows and is the safest start method, but it is not available on Unix systems and is slower than "fork".

TYPE: Optional[Literal['fork', 'spawn']] DEFAULT: None

gpu_worker_devices

List of GPU devices to use for the GPU workers. Defaults to all available devices, one worker per device. Only used with "multiprocessing" backend.

TYPE: Optional[List[str]] DEFAULT: None

cpu_worker_devices

List of GPU devices to use for the CPU workers. Used for debugging purposes.

TYPE: Optional[List[str]] DEFAULT: None

deterministic

Whether to try and preserve the order of the documents in "multiprocessing" mode. If set to False, workers will process documents whenever they are available in a dynamic fashion, which may result in out-of-order but usually faster processing. If set to true, tasks will be distributed in a static, round-robin fashion to workers. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
Stream

map [source]

Maps a callable to the documents. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection. If the callable is a generator function, each element will be yielded to the stream as is.

Parameters

PARAMETER DESCRIPTION
pipe

The callable to map to the documents.

kwargs

The keyword arguments to pass to the callable.

DEFAULT: {}

RETURNS DESCRIPTION
Stream

flatten [source]

Flattens the stream.

RETURNS DESCRIPTION
Stream

map_batches [source]

Maps a callable to a batch of documents. The callable should take a list of inputs. The output of the callable will be flattened if it is a list or a generator, or yielded to the stream as is if it is a single output (tuple or any other type).

Parameters

PARAMETER DESCRIPTION
pipe

The callable to map to the documents.

kwargs

The keyword arguments to pass to the callable.

DEFAULT: {}

batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: BatchBy DEFAULT: None

RETURNS DESCRIPTION
Stream

batchify [source]

Accumulates the documents into batches and yield each batch to the stream.

Parameters

PARAMETER DESCRIPTION
batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: BatchBy DEFAULT: None

RETURNS DESCRIPTION
Stream

map_gpu [source]

Maps a deep learning operation to a batch of documents, on a GPU worker.

Parameters

PARAMETER DESCRIPTION
prepare_batch

A callable that takes a list of documents and a device and returns a batch of tensors (or anything that can be passed to the forward callable). This will be called on a CPU-bound worker, and may be parallelized.

TYPE: Callable[[List, Union[str, device]], Any]

forward

A callable that takes the output of prepare_batch and returns the output of the deep learning operation. This will be called on a GPU-bound worker.

TYPE: Callable[[Any], Any]

postprocess

An optional callable that takes the list of documents and the output of the deep learning operation, and returns the final output. This will be called on the same CPU-bound worker that called the prepare_batch function.

TYPE: Optional[Callable[[List, Any], Any]] DEFAULT: None

batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: BatchBy DEFAULT: None

RETURNS DESCRIPTION
Stream

map_pipeline [source]

Maps a pipeline to the documents, i.e. adds each component of the pipeline to the stream operations. This function is called under the hood by nlp.pipe()

Parameters

PARAMETER DESCRIPTION
model

The pipeline to map to the documents.

TYPE: Pipeline

batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: BatchBy DEFAULT: None

RETURNS DESCRIPTION
Stream

shuffle [source]

Shuffles the stream by accumulating the documents into batches and shuffling the batches. We try to optimize and avoid the accumulation by shuffling items directly in the reader, but if some upstream operations are not elementwise or if the reader is not compatible with the batching mode, we have to accumulate the documents into batches and shuffle the batches.

For instance, imagine a reading from list of 2 very large documents and applying an operation to split the documents into sentences. Shuffling only in the reader, then applying the split operation would not shuffle the sentences across documents and may lead to a lack of randomness when training a model. Think of this as having lumps after mixing your data. In our case, we detect that the split op is not elementwise and trigger the accumulation of sentences into batches after their generation before shuffling the batches.

Parameters

PARAMETER DESCRIPTION
batch_size

The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc.

TYPE: Optional[Union[int, float, str]] DEFAULT: None

batch_by

Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs".

TYPE: Optional[str, BatchFn] DEFAULT: None

seed

The seed to use for shuffling.

TYPE: Optional[int] DEFAULT: None

shuffle_reader

Whether to shuffle the reader. Defaults to True if the reader is compatible with the batch_by mode, False otherwise.

TYPE: Optional[Union[bool, str]] DEFAULT: None

RETURNS DESCRIPTION
Stream

loop [source]

Loops over the stream indefinitely.

Note that we cycle over items produced by the reader, not the items produced by the stream operations. This means that the stream operations will be applied to the same items multiple times, and may produce different results if they are non-deterministic. This also mean that calling this function will have the same effect regardless of the operations applied to the stream before calling it, ie:

stream.loop().map(...)
# is equivalent to
stream.map(...).loop()
RETURNS DESCRIPTION
Stream

torch_components [source]

Yields components that are PyTorch modules.

RETURNS DESCRIPTION
Iterable['edsnlp.core.torch_component.TorchComponent']

train [source]

Enables training mode on pytorch modules

Parameters

PARAMETER DESCRIPTION
mode

Whether to enable training or not

DEFAULT: True

eval [source]

Enables evaluation mode on pytorch modules

SpanAttributeMetric

The eds.span_attribute metric evaluates span‐level attribute classification by comparing predicted and gold attribute values on the same set of spans. For each attribute you specify, it computes Precision, Recall, F1, number of true positives (tp), number of gold instances (support), number of predicted instances (positives), and the Average Precision (ap). A micro‐average over all attributes is also provided under micro_key.

from edsnlp.metrics.span_attribute import SpanAttributeMetric

metric = SpanAttributeMetric(
    span_getter=conv.span_setter,
    # Evaluated attributes
    attributes={
        "neg": True,  # 'neg' on every entity
        "carrier": ["DIS"],  # 'carrier' only on 'DIS' entities
    },
    # Ignore these default values when counting matches
    default_values={
        "neg": False,
    },
    micro_key="micro",
)

Let's enumerate (span -> attr = value) items in our documents. Only the items with matching span boundaries, attribute name, and value are counted as a true positives. For instance, with the predicted and reference spans of the example above:

pred

ref

fièvreux → neg = True
du diabète → neg = False
du diabète → carrier = PATIENT
cancer → neg = True
cancer → carrier = PATIENT

fièvreux → neg = True
du diabète → neg = False
du diabète → carrier = FATHER
cancer → neg = False
cancer → carrier = PATIENT

Default values

Note that there we don't count "neg=False" items, shown in grey in the table. In EDS-NLP, this is done by setting defaults_values={"neg": False} when creating the metric. This is quite common in classification tasks, where one of the values is both the most common and the "default" (hence the name of the parameter). Counting these values would likely skew the micro-average metrics towards the default value.

Precision, Recall and F1 (micro-average and per‐label) are computed as follows:

  • Precision: p = |matched items of pred| / |pred|
  • Recall: r = |matched items of ref| / |ref|
  • F1: f = 2 / (1/p + 1/f)

This yields the following metrics:

metric([ref], [pred])
# Out: {
#   'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4, 'ap': 0.17},
#   'neg': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2, 'ap': 0.0},
#   'carrier': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2, 'ap': 0.25},
# }

Parameters

PARAMETER DESCRIPTION
span_getter

The span getter to extract spans from each Doc.

TYPE: SpanGetterArg

attributes

Map each attribute name to True (evaluate on all spans) or a sequence of labels restricting which spans to test.

TYPE: Mapping[str, Union[bool, Sequence[str]]] DEFAULT: None

default_values

Attribute values to omit from micro‐average counts (e.g., common negative or default labels).

TYPE: Dict[str, Any] DEFAULT: {}

include_falsy

If False, ignore falsy values (e.g., False, None, '') in predictions or gold when computing metrics; if True, count them.

TYPE: bool DEFAULT: False

micro_key

Key under which to store the micro‐averaged results across all attributes.

TYPE: str DEFAULT: 'micro'

filter_expr

A Python expression (using doc) to filter which examples are scored.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Dict[str, Dict[str, float]]

A dictionary mapping each attribute name (and the micro_key) to its metrics:

  • label or micro_key :

    • p : precision
    • r : recall
    • f : F1 score
    • tp : true positive count
    • support : number of gold instances
    • positives : number of predicted instances
    • ap : average precision

__call__

Compute the span attribute metrics for the given examples.

Parameters

PARAMETER DESCRIPTION
examples

The examples to score, either a tuple of (golds, preds) or a list of spacy.training.Example objects

TYPE: Examples DEFAULT: ()

RETURNS DESCRIPTION
Dict[str, Dict[str, float]]

The scores for the attributes

BatchSizeArg

Bases: Validated

Batch size argument validator / caster for confit/pydantic

Examples

def fn(batch_size: BatchSizeArg):
    return batch_size


print(fn("10 samples"))
# Out: (10, "samples")

print(fn("10 words"))
# Out: (10, "words")

print(fn(10))
# Out: (10, "samples")

TorchComponent [source]

Bases: BaseComponent, Module, Generic[BatchOutput, BatchInput]

A TorchComponent is a Component that can be trained and inherits torch.nn.Module. You can use it either as a torch module inside a more complex neural network, or as a standalone component in a Pipeline.

In addition to the methods of a torch module, a TorchComponent adds a few methods to handle preprocessing and collating features, as well as caching intermediate results for components that share a common subcomponent.

post_init [source]

This method completes the attributes of the component, by looking at some documents. It is especially useful to build vocabularies or detect the labels of a classification task.

Parameters

PARAMETER DESCRIPTION
gold_data

The documents to use for initialization.

TYPE: Iterable[Doc]

exclude

The names of components to exclude from initialization. This argument will be gradually updated with the names of initialized components

TYPE: Set[str]

preprocess [source]

Preprocess the document to extract features that will be used by the neural network and its subcomponents on to perform its predictions.

Parameters

PARAMETER DESCRIPTION
doc

Document to preprocess

TYPE: Doc

RETURNS DESCRIPTION
Dict[str, Any]

Dictionary (optionally nested) containing the features extracted from the document.

collate [source]

Collate the batch of features into a single batch of tensors that can be used by the forward method of the component.

Parameters

PARAMETER DESCRIPTION
batch

Batch of features

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
BatchInput

Dictionary (optionally nested) containing the collated tensors

batch_to_device [source]

Move the batch of tensors to the specified device.

Parameters

PARAMETER DESCRIPTION
batch

Batch of tensors

TYPE: BatchInput

device

Device to move the tensors to

TYPE: Optional[Union[str, device]]

RETURNS DESCRIPTION
BatchInput

forward [source]

Perform the forward pass of the neural network.

Parameters

PARAMETER DESCRIPTION
batch

Batch of tensors (nested dictionary) computed by the collate method

TYPE: BatchInput

RETURNS DESCRIPTION
BatchOutput

compute_training_metrics [source]

Compute post-gather metrics on the batch output. This is a no-op by default. This is useful to compute averages when doing multi-gpu training or mini-batch accumulation since full denominators are not known during the forward pass.

module_forward [source]

This is a wrapper around torch.nn.Module.__call__ to avoid conflict with the components __call__ method.

preprocess_batch [source]

Convenience method to preprocess a batch of documents. Features corresponding to the same path are grouped together in a list, under the same key.

Parameters

PARAMETER DESCRIPTION
docs

Batch of documents

TYPE: Sequence[Doc]

supervision

Whether to extract supervision features or not

DEFAULT: False

RETURNS DESCRIPTION
Dict[str, Sequence[Any]]

The batch of features

prepare_batch [source]

Convenience method to preprocess a batch of documents and collate them Features corresponding to the same path are grouped together in a list, under the same key.

Parameters

PARAMETER DESCRIPTION
docs

Batch of documents

TYPE: Sequence[Doc]

supervision

Whether to extract supervision features or not

TYPE: bool DEFAULT: False

device

Device to move the tensors to

TYPE: Optional[Union[str, device]] DEFAULT: None

RETURNS DESCRIPTION
Dict[str, Sequence[Any]]

batch_process [source]

Process a batch of documents using the neural network. This differs from the pipe method in that it does not return an iterator, but executes the component on the whole batch at once.

Parameters

PARAMETER DESCRIPTION
docs

Batch of documents

TYPE: Sequence[Doc]

RETURNS DESCRIPTION
Sequence[Doc]

Batch of updated documents

postprocess [source]

Update the documents with the predictions of the neural network. By default, this is a no-op.

Parameters

PARAMETER DESCRIPTION
docs

List of documents to update

TYPE: Sequence[Doc]

results

Batch of predictions, as returned by the forward method

TYPE: BatchOutput

inputs

List of preprocessed features, as returned by the preprocess method

TYPE: List[Dict[str, Any]]

RETURNS DESCRIPTION
Sequence[Doc]

preprocess_supervised [source]

Preprocess the document to extract features that will be used by the neural network to perform its training. By default, this returns the same features as the preprocess method.

Parameters

PARAMETER DESCRIPTION
doc

Document to preprocess

TYPE: Doc

RETURNS DESCRIPTION
Dict[str, Any]

Dictionary (optionally nested) containing the features extracted from the document.

pipe [source]

Applies the component on a collection of documents. It is recommended to use the Pipeline.pipe method instead of this one to apply a pipeline on a collection of documents, to benefit from the caching of intermediate results.

Parameters

PARAMETER DESCRIPTION
docs

Input docs

TYPE: Iterable[Doc]

batch_size

Batch size to use when making batched to be process at once

DEFAULT: 1

LinearSchedule

Bases: Schedule

Linear schedule for a parameter group. The schedule will linearly increase the value from start_value to max_value in the first warmup_rate of the total_steps and then linearly decrease it to end_value.

Parameters

PARAMETER DESCRIPTION
total_steps

The total number of steps, usually used to calculate ratios.

TYPE: Optional[int] DEFAULT: None

max_value

The maximum value to reach.

TYPE: Optional[Any] DEFAULT: None

start_value

The initial value.

TYPE: float DEFAULT: 0.0

path

The path to the attribute to set.

TYPE: Optional[Union[str, int, List[Union[str, int]]]] DEFAULT: None

warmup_rate

The rate of the warmup.

TYPE: float DEFAULT: 0.0

end_value

The final value to reach after the decay phase. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

ScheduledOptimizer

Bases: Optimizer

Wrapper optimizer that supports schedules for the parameters and easy parameter selection using the key of the groups dictionary as regex patterns to match the parameter names.

Schedules are defined directly in the groups, in place of the scheduled value.

Examples

optim = ScheduledOptimizer(
    cls="adamw",
    module=model,
    groups=[
        # Exclude all parameters matching 'bias' from optimization.
        {
            "selector": "bias",
            "exclude": True,
        },
        # Parameters of the NER module's embedding receive this learning rate
        # schedule. If a parameter matches both 'transformer' and 'ner',
        # the first group settings take precedence due to the order.
        {
            "selector": "^ner[.]embedding"
            "lr": {
                "@schedules": "linear",
                "start_value": 0.0,
                "max_value": 5e-4,
                "warmup_rate": 0.2,
            },
        },
        # Parameters starting with 'ner' receive this learning rate schedule,
        # unless a 'lr' value has already been set by an earlier selector.
        {
            "selector": "^ner"
            "lr": {
                "@schedules": "linear",
                "start_value": 0.0,
                "max_value": 1e-4,
                "warmup_rate": 0.2,
            },
        },
        # Apply a weight_decay of 0.01 to all parameters not excluded.
        # This setting doesn't conflict with others and applies to all.
        {
            "selector": "",
            "weight_decay": 0.01,
        },
    ],
    total_steps=1000,
)

Parameters

PARAMETER DESCRIPTION
optim

The optimizer to use. If a string (like "adamw") or a type to instantiate, the module and groups must be provided.

TYPE: Union[str, Type[Optimizer], Optimizer]

module

The module to optimize. Usually the nlp pipeline object.

TYPE: Optional[Union[PipelineProtocol, Module]] DEFAULT: None

total_steps

The total number of steps, used for schedules.

TYPE: Optional[int] DEFAULT: None

groups

The groups to optimize. Each group is a dictionary containing:

  • a regex selector key to match the parameter of that group by their names (as listed by nlp.named_parameters())
  • and several other keys that define the optimizer parameters for that group, such as lr, weight_decay etc. The value for these keys can be a Schedule instance or a simple value
  • an exclude key that can be set to True to exclude parameters

The matching is performed by running regex.search(selector, name) so you do not have to match the full name. Note that the order of the groups matters. If a parameter name matches multiple selectors, the configurations of these selectors are combined in reverse order (from the last matched selector to the first), allowing later selectors to complete options from earlier ones. If a selector contains exclude=True, any parameter matching it is excluded from optimization.

TYPE: Optional[List[Group]] DEFAULT: None

GenericScorer

A scorer to evaluate the model performance on various tasks.

Parameters

PARAMETER DESCRIPTION
batch_size

The batch size to use for scoring. Can be an int (number of documents) or a string (batching expression like "2000 words").

TYPE: Union[int, str] DEFAULT: 1

speed

Whether to compute the model speed (words/documents per second)

TYPE: bool DEFAULT: True

autocast

Whether to use autocasting for mixed precision during the evaluation, defaults to True.

TYPE: Union[bool, Any] DEFAULT: None

metrics

A keyword arguments mapping of metric names to metrics objects. See the metrics documentation for more info.

DEFAULT: {}

TrainingData

A training data object.

Parameters

PARAMETER DESCRIPTION
data

The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components.

TYPE: Stream

batch_size

The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the eds.span_pooler component produces a "spans" statistic, that can be used to produce batches of no more than 16 spans by setting batch_size to "16 spans".

TYPE: BatchSizeArg

shuffle

The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words.

TYPE: Union[str, Literal[False]]

sub_batch_size

How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. To split a batch of 8000 tokens into smaller batches of 1000 tokens each, just set this to "1000 tokens".

You can also request a number of splits, like "4 splits", to split the batch into N parts each close to (but less than) batch_size / N.

TYPE: Optional[BatchSizeArg] DEFAULT: None

pipe_names

The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes.

TYPE: Optional[AsList[str]] DEFAULT: None

post_init

Whether to call the pipeline's post_init method with the data before training.

TYPE: bool DEFAULT: True

stat_batchify [source]

Create a batching function that uses the value of a specific key in the items to determine the batch size. This function is primarily meant to be used on the flattened outputs of the preprocess method of a Pipeline object.

It expects the items to be a dictionary in which some keys contain the string "/stats/" and the key pattern. For instance:

from edsnlp.utils.batching import stat_batchify

items = [
    {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
    {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
]
batcher = stat_batchify("words")
assert list(batcher(items, 4)) == [
    [
        {"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
        {"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
    ],
    [
        {"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
    ],
]

Parameters

PARAMETER DESCRIPTION
key

The key pattern to use to determine the actual key to look up in the items.

RETURNS DESCRIPTION
Callable[[Iterable, int, bool, Literal["drop", "split"]], Iterable

decompress_dict [source]

Decompress a dictionary of lists into a sequence of dictionaries. This function assumes that the dictionary structure was obtained using the batch_compress_dict class. Keys that were merged into a single string using the "|" character as a separator will be split into a nested dictionary structure.

Parameters

PARAMETER DESCRIPTION
seq

The dictionary to decompress or a sequence of dictionaries to decompress

TYPE: Union[Iterable[Dict[str, Any]], Dict[str, Any]]

ld_to_dl [source]

Convert a list of dictionaries to a dictionary of lists

Parameters

PARAMETER DESCRIPTION
ld

The list of dictionaries

TYPE: Iterable[Mapping[str, T]]

RETURNS DESCRIPTION
Dict[str, List[T]]

The dictionary of lists