edsnlp.matchers.simstring
SimstringWriter [source]
A context class to write a simstring database
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
path | Path to database TYPE: |
SimstringMatcher [source]
PhraseMatcher that allows to skip excluded tokens. Heavily inspired by https://github.com/Georgetown-IR-Lab/QuickUMLS
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
vocab | spaCy vocabulary to match on. TYPE: |
path | Path where we will store the precomputed patterns TYPE: |
measure | Name of the similarity measure. One of [jaccard, dice, overlap, cosine] TYPE: |
windows | Maximum number of words in a candidate span TYPE: |
threshold | Minimum similarity value to match a concept's synonym TYPE: |
ignore_excluded | Whether to exclude tokens that have an EXCLUDED tag, by default False TYPE: |
ignore_space_tokens | Whether to exclude tokens that have a "SPACE" tag, by default False TYPE: |
attr | Default attribute to match on, by default "TEXT". Can be overridden in the TYPE: |
build_patterns [source]
Build patterns and adds them for matching.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
nlp | The instance of the spaCy language class. TYPE: |
terms | Dictionary of label/terms, or label/dictionary of terms/attribute. TYPE: |
progress | Whether to track progress when preprocessing terms TYPE: |
get_text_and_offsets cached
Align different representations of a Doc or Span object.
Parameters
| PARAMETER | DESCRIPTION |
|---|---|
doclike | spaCy TYPE: |
attr | Attribute to use, by default TYPE: |
ignore_excluded | Whether to remove excluded tokens, by default True TYPE: |
ignore_space_tokens | Whether to remove space tokens, by default False TYPE: |
| RETURNS | DESCRIPTION |
|---|---|
Tuple[str, List[Tuple[int, int, int, int]]] | The new clean text and offset tuples for each word giving the begin char indice of the word in the new text, the end char indice of its preceding word and the begin / end indices of the word in the original document |