`edsnlp.matchers.simstring`

`SimstringWriter` [source]

A context class to write a simstring database

PARAMETER DESCRIPTION

path

Path to database

TYPE: Union[str, Path]

PhraseMatcher that allows to skip excluded tokens. Heavily inspired by https://github.com/Georgetown-IR-Lab/QuickUMLS

PARAMETER	DESCRIPTION
`vocab`	spaCy vocabulary to match on. TYPE: `Vocab`
`path`	Path where we will store the precomputed patterns TYPE: `Optional[Union[Path, str]]` DEFAULT: `None`
`measure`	Name of the similarity measure. One of [jaccard, dice, overlap, cosine] TYPE: `SimilarityMeasure` DEFAULT: `dice`
`windows`	Maximum number of words in a candidate span TYPE: `int` DEFAULT: `5`
`threshold`	Minimum similarity value to match a concept's synonym TYPE: `float` DEFAULT: `0.75`
`ignore_excluded`	Whether to exclude tokens that have an EXCLUDED tag, by default False TYPE: `Optional[bool]` DEFAULT: `False`
`ignore_space_tokens`	Whether to exclude tokens that have a "SPACE" tag, by default False TYPE: `Optional[bool]` DEFAULT: `False`
`attr`	Default attribute to match on, by default "TEXT". Can be overridden in the `add` method. To match on a custom attribute, prepend the attribute name with `_`. TYPE: `str` DEFAULT: `'NORM'`

Build patterns and adds them for matching.

PARAMETER DESCRIPTION

nlp

The instance of the spaCy language class.

TYPE: PipelineProtocol

terms

Dictionary of label/terms, or label/dictionary of terms/attribute.

TYPE: Patterns

progress

Whether to track progress when preprocessing terms

TYPE: bool DEFAULT: False

Align different representations of a Doc or Span object.

PARAMETER	DESCRIPTION
`doclike`	spaCy `Doc` or `Span` object TYPE: `Doc`
`attr`	Attribute to use, by default `"TEXT"` TYPE: `str` DEFAULT: `'TEXT'`
`ignore_excluded`	Whether to remove excluded tokens, by default True TYPE: `bool` DEFAULT: `True`
`ignore_space_tokens`	Whether to remove space tokens, by default False TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`Tuple[str, List[Tuple[int, int, int, int]]]`	The new clean text and offset tuples for each word giving the begin char indice of the word in the new text, the end char indice of its preceding word and the begin / end indices of the word in the original document