Skip to content

edsnlp.matchers.simstring

SimstringWriter [source]

A context class to write a simstring database

Parameters

PARAMETER DESCRIPTION
path

Path to database

TYPE: Union[str, Path]

SimstringMatcher [source]

PhraseMatcher that allows to skip excluded tokens. Heavily inspired by https://github.com/Georgetown-IR-Lab/QuickUMLS

Parameters

PARAMETER DESCRIPTION
vocab

spaCy vocabulary to match on.

TYPE: Vocab

path

Path where we will store the precomputed patterns

TYPE: Optional[Union[Path, str]] DEFAULT: None

measure

Name of the similarity measure. One of [jaccard, dice, overlap, cosine]

TYPE: SimilarityMeasure DEFAULT: dice

windows

Maximum number of words in a candidate span

TYPE: int DEFAULT: 5

threshold

Minimum similarity value to match a concept's synonym

TYPE: float DEFAULT: 0.75

ignore_excluded

Whether to exclude tokens that have an EXCLUDED tag, by default False

TYPE: Optional[bool] DEFAULT: False

ignore_space_tokens

Whether to exclude tokens that have a "SPACE" tag, by default False

TYPE: Optional[bool] DEFAULT: False

attr

Default attribute to match on, by default "TEXT". Can be overridden in the add method. To match on a custom attribute, prepend the attribute name with _.

TYPE: str DEFAULT: 'NORM'

build_patterns [source]

Build patterns and adds them for matching.

Parameters

PARAMETER DESCRIPTION
nlp

The instance of the spaCy language class.

TYPE: PipelineProtocol

terms

Dictionary of label/terms, or label/dictionary of terms/attribute.

TYPE: Patterns

progress

Whether to track progress when preprocessing terms

TYPE: bool DEFAULT: False

get_text_and_offsets cached

Align different representations of a Doc or Span object.

Parameters

PARAMETER DESCRIPTION
doclike

spaCy Doc or Span object

TYPE: Doc

attr

Attribute to use, by default "TEXT"

TYPE: str DEFAULT: 'TEXT'

ignore_excluded

Whether to remove excluded tokens, by default True

TYPE: bool DEFAULT: True

ignore_space_tokens

Whether to remove space tokens, by default False

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
Tuple[str, List[Tuple[int, int, int, int]]]

The new clean text and offset tuples for each word giving the begin char indice of the word in the new text, the end char indice of its preceding word and the begin / end indices of the word in the original document