Transformer[source]
The eds.transformer component is a wrapper around HuggingFace's transformers library. If you are not familiar with transformers, a good way to start is the Illustrated Transformer tutorial.
Compared to using the raw Huggingface model, we offer a simple mechanism to split long documents into strided windows before feeding them to the model.
Windowing
EDS-NLP's Transformer component splits long documents into smaller windows before feeding them to the model. This is done to avoid hitting the maximum number of tokens that can be processed by the model on a single device. The window size and stride can be configured using the window and stride parameters. The default values are 512 and 256 respectively, which means that the model will process windows of 512 tokens, each separated by 256 tokens. Whenever a token appears in multiple windows, the embedding of the "most contextualized" occurrence is used, i.e. the occurrence that is the closest to the center of its window.
Here is an overview how this works to produce embeddings (shown in red) for each word of the document :
Examples
Here is an example of how to define a pipeline with a Transformer component:
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.transformer(
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
)
doc1 = nlp.make_doc("My name is Michael.")
doc2 = nlp.make_doc("And I am the best boss in the world.")
prep = nlp.pipes.transformer.preprocess_batch([doc1, doc2])
batch = nlp.pipes.transformer.collate(prep)
res = nlp.pipes.transformer(batch)
# Embeddings are flattened by default
print(res["embeddings"].shape)
# Out: torch.Size([15, 128])
# But they can be refolded to materialize the sample dimension
print(res["embeddings"].refold("sample", "word").shape)
# Out: torch.Size([2, 10, 128])
You can compose this embedding with a task specific component such as eds.ner_crf.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
nlp | The pipeline instance
|
name | The component name
|
model | The Huggingface model name or path
|
window | The window size to use when splitting long documents into smaller windows before feeding them to the Transformer model (default: 512 = 512 - 2) DEFAULT: |
stride | The stride (distance between windows) to use when splitting long documents into smaller windows: (default: 96) DEFAULT: |
training_stride | If False, the stride will be set to the window size during training, meaning that there will be no overlap between windows. If True, the stride will be set to the DEFAULT: |
max_tokens_per_device | The maximum number of tokens that can be processed by the model on a single device. This does not affect the results but can be used to reduce the memory usage of the model, at the cost of a longer processing time. If "auto", the component will try to estimate the maximum number of tokens that can be processed by the model on the current device at a given time. DEFAULT: |
new_tokens | A list of (pattern, replacement) tuples to add to the tokenizer. The pattern should be a valid regular expression. The replacement should be a string. This can be used to add new tokens to the tokenizer that are not present in the original vocabulary. For example, if you want to add a new token for new lines you can use the following: DEFAULT: |
quantization | Quantization configuration to use for the model. If None, no quantization will be applied. This requires the DEFAULT: |
word_pooling_mode | If "mean", the embeddings of the wordpieces corresponding to each word will be averaged to produce a single embedding per word. If False, the embeddings of the wordpieces will be returned as a FoldedTensor with an additional "token" dimension. (default: "mean") DEFAULT: |