Skip to content

edsnlp.utils.fuzzy_alignment

DeltaCollection [source]

Bases: object

Copied from https://github.com/percevalw/nlstruct/blob/master/nlstruct/data_utils.py#L13

align [source]

Align entities from two similar, but not identical documents. The entities of the old document are aligned to the new document based on string matching and context similarity, and annotated to the new document.

Unlike approaches relying on diffing algorithms, this method can handle insertions and deletions and swaps of text blocks.

This method was developed during our work on edspdf, to transfer annotations from dataset annotated on previous versions of documents to newer versions of the same documents. Yet, there is significant room for improvement, especially in the similarity scoring scheme.

Parameters

PARAMETER DESCRIPTION
old

The old document with entities to align.

TYPE: AnnotatedText

new

The new document to align entities to.

TYPE: AnnotatedText

sim_scheme

A list of (context size, weight) tuples to use for similarity scoring. Each tuple defines how many characters of left and right context to consider and the weight of that context in the overall similarity score.

TYPE: List[Tuple[int, float]] DEFAULT: [(20, 0.7), (50, 0.2), (100, 0.15), (400, 0.1), (1000, 0.05)]

threshold

The similarity score threshold above which a match is considered good. It's computed as a weighted sum of left and right similarity scores from different context sizes.

TYPE: float DEFAULT: 1.0

do_debug

Whether to print debug information.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Result