edsnlp.train
Pipeline
Bases: Validated
New pipeline to use as a drop-in replacement for spaCy's pipeline. It uses PyTorch as the deep-learning backend and allows components to share subcomponents.
See the documentation for more details.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
lang | Language code TYPE: |
create_tokenizer | Function that creates a tokenizer for the pipeline TYPE: |
vocab | Whether to create a new vocab or use an existing one TYPE: |
vocab_config | Configuration for the vocab TYPE: |
meta | Meta information about the pipeline TYPE: |
disabled property
The names of the disabled components
cfg property
Returns the config of the pipeline, including the config of all components. Updated from spacy to allow references between components.
get_pipe
Get a component by its name.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
name | The name of the component to get. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Pipe | |
has_pipe
Check if a component exists in the pipeline.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
name | The name of the component to check. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
bool | |
create_pipe
Create a component from a factory name.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
factory | The name of the factory to use TYPE: |
name | The name of the component TYPE: |
config | The config to pass to the factory TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Pipe | |
add_pipe
Add a component to the pipeline.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
factory | The name of the component to add or the component itself TYPE: |
name | The name of the component. If not provided, the name of the component will be used if it has one (.name), otherwise the factory name will be used. TYPE: |
first | Whether to add the component to the beginning of the pipeline. This argument is mutually exclusive with TYPE: |
before | The name of the component to add the new component before. This argument is mutually exclusive with TYPE: |
after | The name of the component to add the new component after. This argument is mutually exclusive with TYPE: |
config | The arguments to pass to the component factory. Note that instead of replacing arguments with the same keys, the config will be merged with the default config of the component. This means that you can override specific nested arguments without having to specify the entire config. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Pipe | The component that was added to the pipeline. |
get_pipe_meta
Get the meta information for a component.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
name | The name of the component to get the meta for. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any] | |
make_doc
Create a Doc from text.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
text | The text to create the Doc from. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Doc | |
__call__
Apply each component successively on a document.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
text | The text to create the Doc from, or a Doc. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Doc | |
pipe
Process a stream of documents by applying each component successively on batches of documents.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
inputs | The inputs to create the Docs from, or Docs directly. TYPE: |
n_process | Deprecated. Use the ".set_processing(num_cpu_workers=n_process)" method on the returned data stream instead. The number of parallel workers to use. If 0, the operations will be executed sequentially. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
cache
Enable caching for all (trainable) components in the pipeline
torch_components
Yields components that are PyTorch modules.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
disable | The names of disabled components, which will be skipped. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Iterable[Tuple[str, TorchComponent]] | |
connected_pipes_names
Returns a list of lists of connected components in the pipeline, i.e. components that share at least one parameter.
| RETURNS | DESCRIPTION |
|---|---|
List[List[str]] | |
post_init
Completes the initialization of the pipeline by calling the post_init method of all components that have one. This is useful for components that need to see some data to build their vocabulary, for instance.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
data | The documents to use for initialization. Each component will not necessarily see all the data. TYPE: |
exclude | Components to exclude from post initialization on data TYPE: |
from_config classmethod
Create a pipeline from a config object
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
config | The config to use TYPE: |
vocab | The spaCy vocab to use. If True, a new vocab will be created TYPE: |
disable | Components to disable TYPE: |
enable | Components to enable TYPE: |
exclude | Components to exclude TYPE: |
meta | Metadata to add to the pipeline TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Pipeline | |
validate classmethod
Pydantic validator, used in the validate_arguments decorated functions
preprocess_many
Runs the preprocessing methods of each component in the pipeline on a collection of documents and returns an iterable of dictionaries containing the results, with the component names as keys.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
docs | TYPE: |
compress | Whether to deduplicate identical preprocessing outputs of the results if multiple documents share identical subcomponents. This step is required to enable the cache mechanism when training or running the pipeline over a tabular datasets such as pyarrow tables that do not store referential equality information. DEFAULT: |
supervision | Whether to include supervision information in the preprocessing DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
collate
Collates a batch of preprocessed samples into a single (maybe nested) dictionary of tensors by calling the collate method of each component.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch | The batch of preprocessed samples TYPE: |
device | Should we move the tensors to a device, if so, which one? TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any] | The collated batch |
parameters
Returns an iterator over the Pytorch parameters of the components in the pipeline
named_parameters
Returns an iterator over the Pytorch parameters of the components in the pipeline
to
Moves the pipeline to a given device
train
Enables training mode on pytorch modules
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
mode | Whether to enable training or not DEFAULT: |
to_disk
Save the pipeline to a directory.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
path | The path to the directory to save the pipeline to. Every component will be saved to separated subdirectories of this directory, except for tensors that will be saved to a shared files depending on the references between the components. TYPE: |
exclude | The names of the components, or attributes to exclude from the saving process. By default, the vocabulary is excluded since it may contain personal identifiers and can be rebuilt during inference. TYPE: |
from_disk
Load the pipeline from a directory. Components will be updated in-place.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
path | The path to the directory to load the pipeline from TYPE: |
exclude | The names of the components, or attributes to exclude from the loading process. TYPE: |
device | Device to use when loading the tensors TYPE: |
select_pipes
Temporarily disable and enable components in the pipeline.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
disable | The name of the component to disable, or a list of names. TYPE: |
enable | The name of the component to enable, or a list of names. TYPE: |
Stream [source]
set_processing [source]
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch_size | The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: |
batch_by | Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: |
num_cpu_workers | Number of CPU workers. A CPU worker handles the non deep-learning components and the preprocessing, collating and postprocessing of deep-learning components. If no GPU workers are used, the CPU workers also handle the forward call of the deep-learning components. TYPE: |
num_gpu_workers | Number of GPU workers. A GPU worker handles the forward call of the deep-learning components. Only used with "multiprocessing" backend. TYPE: |
disable_implicit_parallelism | Whether to disable OpenMP and Huggingface tokenizers implicit parallelism in multiprocessing mode. Defaults to True. TYPE: |
backend | The backend to use for parallel processing. If not set, the backend is automatically selected based on the input data and the number of workers.
TYPE: |
autocast | Whether to use automatic mixed precision (AMP) for the forward pass of the deep-learning components. If True (by default), AMP will be used with the default settings. If False, AMP will not be used. If a dtype is provided, it will be passed to the TYPE: |
show_progress | Whether to show progress bars (only applicable with "simple" and "multiprocessing" backends). TYPE: |
gpu_pipe_names | List of pipe names to accelerate on a GPUWorker, defaults to all pipes that inherit from TorchComponent. Only used with "multiprocessing" backend. Inferred from the pipeline if not set. TYPE: |
process_start_method | Whether to use "fork" or "spawn" as the start method for the multiprocessing backend. The default is "fork" on Unix systems and "spawn" on Windows.
TYPE: |
gpu_worker_devices | List of GPU devices to use for the GPU workers. Defaults to all available devices, one worker per device. Only used with "multiprocessing" backend. TYPE: |
cpu_worker_devices | List of GPU devices to use for the CPU workers. Used for debugging purposes. TYPE: |
deterministic | Whether to try and preserve the order of the documents in "multiprocessing" mode. If set to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
map [source]
Maps a callable to the documents. It takes a callable as input and an optional dictionary of keyword arguments. The function will be applied to each element of the collection. If the callable is a generator function, each element will be yielded to the stream as is.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
pipe | The callable to map to the documents.
|
kwargs | The keyword arguments to pass to the callable. DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
map_batches [source]
Maps a callable to a batch of documents. The callable should take a list of inputs. The output of the callable will be flattened if it is a list or a generator, or yielded to the stream as is if it is a single output (tuple or any other type).
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
pipe | The callable to map to the documents.
|
kwargs | The keyword arguments to pass to the callable. DEFAULT: |
batch_size | The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: |
batch_by | Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
batchify [source]
Accumulates the documents into batches and yield each batch to the stream.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch_size | The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: |
batch_by | Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
map_gpu [source]
Maps a deep learning operation to a batch of documents, on a GPU worker.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
prepare_batch | A callable that takes a list of documents and a device and returns a batch of tensors (or anything that can be passed to the TYPE: |
forward | A callable that takes the output of TYPE: |
postprocess | An optional callable that takes the list of documents and the output of the deep learning operation, and returns the final output. This will be called on the same CPU-bound worker that called the TYPE: |
batch_size | The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: |
batch_by | Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
map_pipeline [source]
Maps a pipeline to the documents, i.e. adds each component of the pipeline to the stream operations. This function is called under the hood by nlp.pipe()
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
model | The pipeline to map to the documents. TYPE: |
batch_size | The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: |
batch_by | Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
shuffle [source]
Shuffles the stream by accumulating the documents into batches and shuffling the batches. We try to optimize and avoid the accumulation by shuffling items directly in the reader, but if some upstream operations are not elementwise or if the reader is not compatible with the batching mode, we have to accumulate the documents into batches and shuffle the batches.
For instance, imagine a reading from list of 2 very large documents and applying an operation to split the documents into sentences. Shuffling only in the reader, then applying the split operation would not shuffle the sentences across documents and may lead to a lack of randomness when training a model. Think of this as having lumps after mixing your data. In our case, we detect that the split op is not elementwise and trigger the accumulation of sentences into batches after their generation before shuffling the batches.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch_size | The batch size. Can also be a batching expression like "32 docs", "1024 words", "dataset", "fragment", etc. TYPE: |
batch_by | Function to compute the batches. If set, it should take an iterable of documents and return an iterable of batches. You can also set it to "docs", "words" or "padded_words" to use predefined batching functions. Defaults to "docs". TYPE: |
seed | The seed to use for shuffling. TYPE: |
shuffle_reader | Whether to shuffle the reader. Defaults to True if the reader is compatible with the batch_by mode, False otherwise. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
loop [source]
Loops over the stream indefinitely.
Note that we cycle over items produced by the reader, not the items produced by the stream operations. This means that the stream operations will be applied to the same items multiple times, and may produce different results if they are non-deterministic. This also mean that calling this function will have the same effect regardless of the operations applied to the stream before calling it, ie:
stream.loop().map(...)
# is equivalent to
stream.map(...).loop()
| RETURNS | DESCRIPTION |
|---|---|
Stream | |
torch_components [source]
Yields components that are PyTorch modules.
| RETURNS | DESCRIPTION |
|---|---|
Iterable['edsnlp.core.torch_component.TorchComponent'] | |
train [source]
Enables training mode on pytorch modules
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
mode | Whether to enable training or not DEFAULT: |
eval [source]
Enables evaluation mode on pytorch modules
SpanAttributeMetric
The eds.span_attribute metric evaluates span‐level attribute classification by comparing predicted and gold attribute values on the same set of spans. For each attribute you specify, it computes Precision, Recall, F1, number of true positives (tp), number of gold instances (support), number of predicted instances (positives), and the Average Precision (ap). A micro‐average over all attributes is also provided under micro_key.
from edsnlp.metrics.span_attribute import SpanAttributeMetric
metric = SpanAttributeMetric(
span_getter=conv.span_setter,
# Evaluated attributes
attributes={
"neg": True, # 'neg' on every entity
"carrier": ["DIS"], # 'carrier' only on 'DIS' entities
},
# Ignore these default values when counting matches
default_values={
"neg": False,
},
micro_key="micro",
)
Let's enumerate (span -> attr = value) items in our documents. Only the items with matching span boundaries, attribute name, and value are counted as a true positives. For instance, with the predicted and reference spans of the example above:
| pred | ref |
|---|---|
| fièvreux → neg = True | fièvreux → neg = True |
Default values
Note that there we don't count "neg=False" items, shown in grey in the table. In EDS-NLP, this is done by setting defaults_values={"neg": False} when creating the metric. This is quite common in classification tasks, where one of the values is both the most common and the "default" (hence the name of the parameter). Counting these values would likely skew the micro-average metrics towards the default value.
Precision, Recall and F1 (micro-average and per‐label) are computed as follows:
- Precision:
p = |matched items of pred| / |pred| - Recall:
r = |matched items of ref| / |ref| - F1:
f = 2 / (1/p + 1/f)
This yields the following metrics:
metric([ref], [pred])
# Out: {
# 'micro': {'f': 0.57, 'p': 0.5, 'r': 0.67, 'tp': 2, 'support': 3, 'positives': 4, 'ap': 0.17},
# 'neg': {'f': 0.67, 'p': 0.5, 'r': 1, 'tp': 1, 'support': 1, 'positives': 2, 'ap': 0.0},
# 'carrier': {'f': 0.5, 'p': 0.5, 'r': 0.5, 'tp': 1, 'support': 2, 'positives': 2, 'ap': 0.25},
# }
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
span_getter | The span getter to extract spans from each TYPE: |
attributes | Map each attribute name to TYPE: |
default_values | Attribute values to omit from micro‐average counts (e.g., common negative or default labels). TYPE: |
include_falsy | If TYPE: |
micro_key | Key under which to store the micro‐averaged results across all attributes. TYPE: |
filter_expr | A Python expression (using TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Dict[str, float]] | A dictionary mapping each attribute name (and the
|
__call__
Compute the span attribute metrics for the given examples.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
examples | The examples to score, either a tuple of (golds, preds) or a list of spacy.training.Example objects TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Dict[str, float]] | The scores for the attributes |
BatchSizeArg
Bases: Validated
Batch size argument validator / caster for confit/pydantic
Examples
def fn(batch_size: BatchSizeArg):
return batch_size
print(fn("10 samples"))
# Out: (10, "samples")
print(fn("10 words"))
# Out: (10, "words")
print(fn(10))
# Out: (10, "samples")
TorchComponent [source]
Bases: BaseComponent, Module, Generic[BatchOutput, BatchInput]
A TorchComponent is a Component that can be trained and inherits torch.nn.Module. You can use it either as a torch module inside a more complex neural network, or as a standalone component in a Pipeline.
In addition to the methods of a torch module, a TorchComponent adds a few methods to handle preprocessing and collating features, as well as caching intermediate results for components that share a common subcomponent.
post_init [source]
This method completes the attributes of the component, by looking at some documents. It is especially useful to build vocabularies or detect the labels of a classification task.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
gold_data | The documents to use for initialization. TYPE: |
exclude | The names of components to exclude from initialization. This argument will be gradually updated with the names of initialized components TYPE: |
preprocess [source]
Preprocess the document to extract features that will be used by the neural network and its subcomponents on to perform its predictions.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
doc | Document to preprocess TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any] | Dictionary (optionally nested) containing the features extracted from the document. |
collate [source]
Collate the batch of features into a single batch of tensors that can be used by the forward method of the component.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch | Batch of features TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
BatchInput | Dictionary (optionally nested) containing the collated tensors |
batch_to_device [source]
Move the batch of tensors to the specified device.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch | Batch of tensors TYPE: |
device | Device to move the tensors to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
BatchInput | |
forward [source]
Perform the forward pass of the neural network.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch | Batch of tensors (nested dictionary) computed by the collate method TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
BatchOutput | |
compute_training_metrics [source]
Compute post-gather metrics on the batch output. This is a no-op by default. This is useful to compute averages when doing multi-gpu training or mini-batch accumulation since full denominators are not known during the forward pass.
module_forward [source]
This is a wrapper around torch.nn.Module.__call__ to avoid conflict with the components __call__ method.
preprocess_batch [source]
Convenience method to preprocess a batch of documents. Features corresponding to the same path are grouped together in a list, under the same key.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
docs | Batch of documents TYPE: |
supervision | Whether to extract supervision features or not DEFAULT: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Sequence[Any]] | The batch of features |
prepare_batch [source]
Convenience method to preprocess a batch of documents and collate them Features corresponding to the same path are grouped together in a list, under the same key.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
docs | Batch of documents TYPE: |
supervision | Whether to extract supervision features or not TYPE: |
device | Device to move the tensors to TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Sequence[Any]] | |
batch_process [source]
Process a batch of documents using the neural network. This differs from the pipe method in that it does not return an iterator, but executes the component on the whole batch at once.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
docs | Batch of documents TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Sequence[Doc] | Batch of updated documents |
postprocess [source]
Update the documents with the predictions of the neural network. By default, this is a no-op.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
docs | List of documents to update TYPE: |
results | Batch of predictions, as returned by the forward method TYPE: |
inputs | List of preprocessed features, as returned by the preprocess method TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Sequence[Doc] | |
preprocess_supervised [source]
Preprocess the document to extract features that will be used by the neural network to perform its training. By default, this returns the same features as the preprocess method.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
doc | Document to preprocess TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any] | Dictionary (optionally nested) containing the features extracted from the document. |
pipe [source]
Applies the component on a collection of documents. It is recommended to use the Pipeline.pipe method instead of this one to apply a pipeline on a collection of documents, to benefit from the caching of intermediate results.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
docs | Input docs TYPE: |
batch_size | Batch size to use when making batched to be process at once DEFAULT: |
LinearSchedule
Bases: Schedule
Linear schedule for a parameter group. The schedule will linearly increase the value from start_value to max_value in the first warmup_rate of the total_steps and then linearly decrease it to end_value.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
total_steps | The total number of steps, usually used to calculate ratios. TYPE: |
max_value | The maximum value to reach. TYPE: |
start_value | The initial value. TYPE: |
path | The path to the attribute to set. TYPE: |
warmup_rate | The rate of the warmup. TYPE: |
end_value | The final value to reach after the decay phase. Defaults to 0.0. TYPE: |
ScheduledOptimizer
Bases: Optimizer
Wrapper optimizer that supports schedules for the parameters and easy parameter selection using the key of the groups dictionary as regex patterns to match the parameter names.
Schedules are defined directly in the groups, in place of the scheduled value.
Examples
optim = ScheduledOptimizer(
cls="adamw",
module=model,
groups=[
# Exclude all parameters matching 'bias' from optimization.
{
"selector": "bias",
"exclude": True,
},
# Parameters of the NER module's embedding receive this learning rate
# schedule. If a parameter matches both 'transformer' and 'ner',
# the first group settings take precedence due to the order.
{
"selector": "^ner[.]embedding"
"lr": {
"@schedules": "linear",
"start_value": 0.0,
"max_value": 5e-4,
"warmup_rate": 0.2,
},
},
# Parameters starting with 'ner' receive this learning rate schedule,
# unless a 'lr' value has already been set by an earlier selector.
{
"selector": "^ner"
"lr": {
"@schedules": "linear",
"start_value": 0.0,
"max_value": 1e-4,
"warmup_rate": 0.2,
},
},
# Apply a weight_decay of 0.01 to all parameters not excluded.
# This setting doesn't conflict with others and applies to all.
{
"selector": "",
"weight_decay": 0.01,
},
],
total_steps=1000,
)
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
optim | The optimizer to use. If a string (like "adamw") or a type to instantiate, the TYPE: |
module | The module to optimize. Usually the TYPE: |
total_steps | The total number of steps, used for schedules. TYPE: |
groups | The groups to optimize. Each group is a dictionary containing:
The matching is performed by running TYPE: |
GenericScorer
A scorer to evaluate the model performance on various tasks.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
batch_size | The batch size to use for scoring. Can be an int (number of documents) or a string (batching expression like "2000 words"). TYPE: |
speed | Whether to compute the model speed (words/documents per second) TYPE: |
autocast | Whether to use autocasting for mixed precision during the evaluation, defaults to True. TYPE: |
metrics | A keyword arguments mapping of metric names to metrics objects. See the metrics documentation for more info. DEFAULT: |
TrainingData
A training data object.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
data | The stream of documents to train on. The documents will be preprocessed and collated according to the pipeline's components. TYPE: |
batch_size | The batch size. Can be a batching expression like "2000 words", an int (number of documents), or a tuple (batch_size, batch_by). The batch_by argument should be a statistic produced by the pipes that will be trained. For instance, the TYPE: |
shuffle | The shuffle strategy. Can be "dataset" to shuffle the entire dataset (this can be memory-intensive for large file based datasets), "fragment" to shuffle the fragment-based datasets like parquet files, or a batching expression like "2000 words" to shuffle the dataset in chunks of 2000 words. TYPE: |
sub_batch_size | How to split each batch into sub-batches that will be fed to the model independently to accumulate gradients over. To split a batch of 8000 tokens into smaller batches of 1000 tokens each, just set this to "1000 tokens". You can also request a number of splits, like "4 splits", to split the batch into N parts each close to (but less than) batch_size / N. TYPE: |
pipe_names | The names of the pipes that should be trained on this data. If None, defaults to all trainable pipes. TYPE: |
post_init | Whether to call the pipeline's post_init method with the data before training. TYPE: |
stat_batchify [source]
Create a batching function that uses the value of a specific key in the items to determine the batch size. This function is primarily meant to be used on the flattened outputs of the preprocess method of a Pipeline object.
It expects the items to be a dictionary in which some keys contain the string "/stats/" and the key pattern. For instance:
from edsnlp.utils.batching import stat_batchify
items = [
{"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
{"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
{"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
]
batcher = stat_batchify("words")
assert list(batcher(items, 4)) == [
[
{"text": "first sample", "obj/stats/words": 2, "obj/stats/chars": 12},
{"text": "dos", "obj/stats/words": 1, "obj/stats/chars": 3},
],
[
{"text": "third one !", "obj/stats/words": 3, "obj/stats/chars": 11},
],
]
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
key | The key pattern to use to determine the actual key to look up in the items.
|
| RETURNS | DESCRIPTION |
|---|---|
Callable[[Iterable, int, bool, Literal["drop", "split"]], Iterable | |
decompress_dict [source]
Decompress a dictionary of lists into a sequence of dictionaries. This function assumes that the dictionary structure was obtained using the batch_compress_dict class. Keys that were merged into a single string using the "|" character as a separator will be split into a nested dictionary structure.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
seq | The dictionary to decompress or a sequence of dictionaries to decompress TYPE: |
ld_to_dl [source]
Convert a list of dictionaries to a dictionary of lists
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
ld | The list of dictionaries TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, List[T]] | The dictionary of lists |