edsnlp.pipes.trainable.doc_classifier.factory
create_component = registry.factory.register('eds.doc_classifier', assigns=['doc._.predicted_class'], deprecated=[])(TrainableDocClassifier) module-attribute [source]
The eds.doc_classifier component is a trainable document-level classifier. In this context, document classification consists in predicting one or more categorical labels at the document level (e.g. diagnosis code, discharge status, or any metadata derived from the whole document).
Unlike span classification, where predictions are attached to spans, the document classifier attaches predictions to the Doc object itself.
Architecture
The model performs multi-head document classification by:
- Calling a word/document embedding component
eds.doc_poolerto compute a pooled embedding for the document. - Feeding the pooled embedding into one or more classification heads. Each head is defined by a linear layer (optionally preceded by a head-specific hidden layer with activation, dropout, and layer norm).
- Computing independent logits for each head.
- Training with a per-head loss (CrossEntropy or Focal), optionally using class weights to handle imbalance.
- Aggregating head losses into a single training loss (simple average).
- During inference, assigning the predicted label for each head to
doc._.labels[head_name].
Each classification head is independent, so different tasks (e.g. predicting ICD-10 category vs. mortality flag) can be trained jointly on the same pooled embeddings.
Examples
To create a document classifier component:
import edsnlp, edsnlp.pipes as eds
nlp = edsnlp.blank("eds")
nlp.add_pipe(
eds.doc_classifier(
label_attr=["icd10", "mortality"],
labels={
"icd10": "data/path_to_label_list_icd10.pkl",
"mortality": "data/path_to_label_list_mortality.pkl",
},
num_classes={
"icd10": 1000,
"mortality": 2,
},
class_weights={
"icd10": "data/path_to_class_weights_icd10.pkl",
"mortality": "data/path_to_class_weights_mortality.pkl",
},
embedding=eds.doc_pooler(
pooling_mode="attention",
embedding=eds.transformer(
model="almanach/camembertav2-base",
window=256,
stride=128,
),
),
hidden_size=1024,
activation_mode="relu",
dropout_rate={
"icd10": 0.05,
"mortality": 0.2,
},
layer_norm=True,
loss="ce",
),
name="doc_classifier",
)
After training, predictions are stored in the Doc object:
doc = nlp("Patient was admitted with pneumonia and discharged alive.")
print(doc._.icd10, doc._.mortality)
# J18 alive
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
nlp | The spaCy/edsnlp pipeline the component belongs to. TYPE: |
name | Component name. TYPE: |
embedding | Embedding component (e.g. transformer + pooling). Must expose an TYPE: |
label_attr | List of head names. Each head corresponds to a document-level attribute (e.g. TYPE: |
num_classes | Number of classes for each head. If not provided, inferred from labels. TYPE: |
label2id | Per-head mapping from label string to integer ID. TYPE: |
id2label | Reverse mapping (ID -> label string). TYPE: |
loss | Loss type, either shared or per-head. TYPE: |
labels | Paths to pickle files containing label sets for each head. TYPE: |
class_weights | Paths to pickle files containing class frequency dicts (converted into class weights). TYPE: |
hidden_size | Hidden layer size (before classifier), shared or per-head. If None, no hidden layer is used. TYPE: |
activation_mode | Activation function for hidden layers, shared or per-head. TYPE: |
dropout_rate | Dropout rate after activation, shared or per-head. TYPE: |
layer_norm | Whether to apply layer normalization in hidden layers, shared or per-head. TYPE: |