edsnlp.utils.fuzzy_alignment
DeltaCollection [source]
Bases: object
Copied from https://github.com/percevalw/nlstruct/blob/master/nlstruct/data_utils.py#L13
align [source]
Align entities from two similar, but not identical documents. The entities of the old document are aligned to the new document based on string matching and context similarity, and annotated to the new document.
Unlike approaches relying on diffing algorithms, this method can handle insertions and deletions and swaps of text blocks.
This method was developed during our work on edspdf, to transfer annotations from dataset annotated on previous versions of documents to newer versions of the same documents. Yet, there is significant room for improvement, especially in the similarity scoring scheme.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
old | The old document with entities to align. TYPE: |
new | The new document to align entities to. TYPE: |
sim_scheme | A list of (context size, weight) tuples to use for similarity scoring. Each tuple defines how many characters of left and right context to consider and the weight of that context in the overall similarity score. TYPE: |
threshold | The similarity score threshold above which a match is considered good. It's computed as a weighted sum of left and right similarity scores from different context sizes. TYPE: |
do_debug | Whether to print debug information. TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Result | |