stringcompare package

class stringcompare.CharacterDifference

Bases: stringcompare.distance._distance.StringComparator

Character difference between two strings.

This is the number of characters differing between two strings. The distance may be normalized or returned as a similarity score instead.

Parameters

normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.CharacterDifference, arg0: str, arg1: str) → float

class stringcompare.Comparator

Bases: pybind11_builtins.pybind11_object

Abstract base class for pybind11 comparator objects.

Provides the compare() function for comparison of two elements, the elementwise() function for elementwise comparison between two lists, and the pairwise() function for pairwise comparison between elements of two lists.

Parameters for the comparison functions (e.g. to return a distance or similarity, whether or not to normalize, weights, etc) should be provided to the constructor.

The current class structure, implemented in C++, is as follows:

Comparator
─┬────────
 ├─► compare()
 │
 ├─► elementwise()
 │
 ├─► pairwise()
 │
 │StringComparator
 └─┬──────────────
   │
   │ Levenshtein
   ├────────────
   │
   │ DamerauLevenshtein
   ├───────────────────
   │
   │ LCSDistance
   ├────────────
   │
   │ Jaro
   ├─────
   │
   │ JaroWinkler
   └────────────
   │
   │ CharacterDifference
   └────────────────────
   │
   │ Hamming
   └────────

See also

StringComparator NumericComparator

compare(self: stringcompare.distance._distance.Comparator, arg0: object, arg1: object) → float

Comparison between two elements.

Parameters

arg0 – Object to compare from.
arg1 – Object to compare to.

Returns

Numeric value of the comparison.

elementwise(self: stringcompare.distance._distance.Comparator, arg0: List[object], arg1: List[object]) → numpy.ndarray[numpy.float64]

Elementwise comparison between two lists.

Parameters

arg0 – List of objects to compare from.
arg1 – List of objects to compare to.

Returns

Numpy array containing comparison values.

Note

The two lists arg0 and arg1 should be of the same length.

pairwise(self: stringcompare.distance._distance.Comparator, arg0: List[object], arg1: List[object]) → numpy.ndarray[numpy.float64]

Pairwise comparison between two lists.

Parameters

arg0 – List of objects to compare from.
arg1 – List of objects to compare to.

Returns

2x2 numpy array containing comparison values, where each row corresponds to an element of arg0 and each column corresponds to an element of arg1.

class stringcompare.DamerauLevenshtein

Bases: stringcompare.distance._distance.StringComparator

Damerau-Levenshtein distance

This is the minimum number of insertions, deletions, substitutions or transpositions required to change one word into the other. The distance may be normalized or returned as a similarity score instead.

Parameters

normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
dmat_size – Length of the internal string buffer. Should be set higher than the maximum string length if check_bounds is set to False. Defaults to 100.
check_bounds – Whether or not to check if string lengths exceed internal buffer size and resize accordingly. Set to False for more efficiency, as long as dmat_size is higher than the maximal string length on which this Comparator object will be called. Defaults to True for safety.

compare(self: stringcompare.distance._distance.DamerauLevenshtein, arg0: str, arg1: str) → float

class stringcompare.DeepparseAddressTagger(deepparse_handle)[source]

Bases: stringcompare.preprocessing.tagger.Tagger

LABELS = ['StreetNumber', 'StreetName', 'Unit', 'Municipality', 'Province', 'PostalCode', 'Orientation', 'GeneralDelivery']

batch_tag(objs: List) → List[Dict][source]

tag(obj) → Dict[source]

class stringcompare.DelimTokenizer(delim=' ')[source]

Bases: stringcompare.preprocessing.tokenizer.Tokenizer

tokenize(sentence)[source]

class stringcompare.Hamming

Bases: stringcompare.distance._distance.StringComparator

Hamming distance between two strings.

This is the number of differences between corresponding characters in the strings.

Parameters

normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.Hamming, arg0: str, arg1: str) → float

class stringcompare.Jaro

Bases: stringcompare.distance._distance.StringComparator

Jaro distance

Parameters: similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.Jaro, arg0: str, arg1: str) → float

class stringcompare.JaroWinkler

Bases: stringcompare.distance._distance.StringComparator

Jaro-Winkler distance

Parameters: similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.

compare(self: stringcompare.distance._distance.JaroWinkler, arg0: str, arg1: str) → float

class stringcompare.LCSDistance

Bases: stringcompare.distance._distance.StringComparator

Longest common subsequence (LCS) distance

This is the minimum number of insertions or deletions required to change one word into the other. The distance may be normalized or returned as a similarity score instead.

Parameters

normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
dmat_size – Length of the internal string buffer. Should be set higher than the maximum string length if check_bounds is set to False. Defaults to 100.
check_bounds – Whether or not to check if string lengths exceed internal buffer size and resize accordingly. Set to False for more efficiency, as long as dmat_size is higher than the maximal string length on which this Comparator object will be called. Defaults to True for safety.

compare(self: stringcompare.distance._distance.LCSDistance, arg0: str, arg1: str) → float

class stringcompare.Levenshtein

Bases: stringcompare.distance._distance.StringComparator

Levenshtein distance

This is defined as the “minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other” (see Wikipedia page). The distance may be normalized or returned as a similarity score instead.

Parameters

normalize – Whether or not to normalize the result to be between 0 and 1. Defaults to True.
similarity – Whether or not to return a similarity score (higher for more similar strings) or a distance score (closer to zero for more similar strings). Defaults to False.
dmat_size – Length of the internal string buffer. Should be set higher than the maximum string length if check_bounds is set to False. Defaults to 100.
check_bounds – Whether or not to check if string lengths exceed internal buffer size and resize accordingly. Set to False for more efficiency, as long as dmat_size is higher than the maximal string length on which this Comparator object will be called. Defaults to True for safety.

Examples

>>> from stringcompare import Levenshtein
>>> lev = Levenshtein()
>>> lev("Olivier", "Oilvier") # Same as lev.compare("Olivier", "Oilvier")
0.25

>>> lev = Levenshtein(normalize=False)
>>> lev("Olivier", "Oilvier")
2.0

>>> lev = Levenshtein(normalize=False, similarity=True)
>>> lev("Olivier", "Oilvier")
6.0

>>> lev.elementwise(["a", "ab"], ["b", "ba"])
array([0.5, 1.])

>>> lev.pairwise(["a", "ab"], ["b", "ba"])
array([[0.5, 1. ],
       [1. , 1. ]])

compare(self: stringcompare.distance._distance.Levenshtein, arg0: str, arg1: str) → float

class stringcompare.NGramTokenizer(n)[source]

Bases: stringcompare.preprocessing.tokenizer.Tokenizer

tokenize(sentence)[source]

class stringcompare.StringComparator

Bases: pybind11_builtins.pybind11_object

compare(self: stringcompare.distance._distance.StringComparator, arg0: str, arg1: str) → float

elementwise(self: stringcompare.distance._distance.StringComparator, arg0: List[str], arg1: List[str]) → numpy.ndarray[numpy.float64]

pairwise(self: stringcompare.distance._distance.StringComparator, arg0: List[str], arg1: List[str]) → numpy.ndarray[numpy.float64]

class stringcompare.Tagger[source]

Bases: abc.ABC

abstract property LABELS

classmethod(function) -> method

Convert a function to be a class method.

A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:

class C:
@classmethod def f(cls, arg1, arg2, …):

…

It can be called either on the class (e.g. C.f()) or on an instance (e.g. C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.

Class methods are different than C++ or Java static methods. If you want those, see the staticmethod builtin.

batch_tag(objs: List) → List[Dict][source]

abstract tag(obj) → Dict[source]

class stringcompare.Tokenizer[source]

Bases: abc.ABC

String tokenization interface.

batch_tokenize(sentences)[source]

abstract tokenize(sentence)[source]

class stringcompare.WhitespaceTokenizer[source]: Bases: stringcompare.preprocessing.tokenizer.DelimTokenizer

Subpackages

stringcompare.preprocessing package