pyspellchecker API¶
Here you can find the full developer API for the pyspellchecker project. pyspellchecker provides a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.
SpellChecker¶
-
class
spellchecker.
SpellChecker
(language: Union[str, Iterable[str]] = 'en', local_dictionary: Union[pathlib.Path, str, None] = None, distance: int = 2, tokenizer: Optional[Callable[[str], Iterable[str]]] = None, case_sensitive: bool = False)[source]¶ The SpellChecker class encapsulates the basics needed to accomplish a simple spell checking algorithm. It is based on the work by Peter Norvig (https://norvig.com/spell-correct.html)
Parameters: - language (str) – The language of the dictionary to load or None for no dictionary. Supported languages are en, es, de, fr, pt, ru, lv, and eu. Defaults to en. A list of languages may be provided and all languages will be loaded.
- local_dictionary (str) – The path to a locally stored word frequency dictionary; if provided, no language will be loaded
- distance (int) – The edit distance to use. Defaults to 2.
- case_sensitive (bool) – Flag to use a case sensitive dictionary or not, only available when not using a language dictionary.
Note
Using a case sensitive dictionary can be slow to correct words.
-
candidates
(word: Union[str, bytes]) → Optional[Set[str]][source]¶ Generate possible spelling corrections for the provided word up to an edit distance of two, if and only when needed
Parameters: word (str) – The word for which to calculate candidate spellings Returns: The set of words that are possible candidates or None if there are no candidates Return type: set
-
correction
(word: Union[str, bytes]) → Optional[str][source]¶ The most probable correct spelling for the word
Parameters: word (str) – The word to correct Returns: The most likely candidate or None if no correction is present Return type: str
-
distance
¶ The maximum edit distance to calculate
Note
Valid values are 1 or 2; if an invalid value is passed, defaults to 2
Type: int
-
edit_distance_1
(word: Union[str, bytes]) → Set[str][source]¶ Compute all strings that are one edit away from word using only the letters in the corpus
Parameters: word (str) – The word for which to calculate the edit distance Returns: The set of strings that are edit distance one from the provided word Return type: set
-
edit_distance_2
(word: Union[str, bytes]) → List[str][source]¶ Compute all strings that are two edits away from word using only the letters in the corpus
Parameters: word (str) – The word for which to calculate the edit distance Returns: The set of strings that are edit distance two from the provided word Return type: set
-
export
(filepath: Union[pathlib.Path, str], encoding: str = 'utf-8', gzipped: bool = True) → None[source]¶ Export the word frequency list for import in the future
Parameters: - filepath (str) – The filepath to the exported dictionary
- encoding (str) – The encoding of the resulting output
- gzipped (bool) – Whether to gzip the dictionary or not
-
known
(words: Iterable[Union[str, bytes]]) → Set[str][source]¶ The subset of words that appear in the dictionary of words
Parameters: words (list) – List of words to determine which are in the corpus Returns: The set of those words from the input that are in the corpus Return type: set
-
classmethod
languages
() → Iterable[str][source]¶ list: A list of all official languages supported by the library
-
split_words
(text: Union[str, bytes]) → Iterable[str][source]¶ Split text into individual words using either a simple whitespace regex or the passed in tokenizer
Parameters: text (str) – The text to split into individual words Returns: A listing of all words in the provided text Return type: list(str)
-
unknown
(words: Iterable[Union[str, bytes]]) → Set[str][source]¶ The subset of words that do not appear in the dictionary
Parameters: words (list) – List of words to determine which are not in the corpus Returns: The set of those words from the input that are not in the corpus Return type: set
-
word_frequency
¶ An encapsulation of the word frequency dictionary
Note
Not settable
Type: WordFrequency
-
word_usage_frequency
(word: Union[str, bytes], total_words: Optional[int] = None) → float[source]¶ Calculate the frequency to the word provided as seen across the entire dictionary
Parameters: - word (str) – The word for which the word probability is calculated
- total_words (int) – The total number of words to use in the calculation; use the default for using the whole word frequency
Returns: The probability that the word is the correct word
Return type: float
WordFrequency¶
-
class
spellchecker.
WordFrequency
(tokenizer=None, case_sensitive=False)[source]¶ Store the dictionary as a word frequency list while allowing for different methods to load the data and update over time
-
add
(word: Union[str, bytes], val: int = 1) → None[source]¶ Add a word to the word frequency list
Parameters: - word (str) – The word to add
- val (int) – The number of times to insert the word
-
dictionary
¶ A counting dictionary of all words in the corpus and the number of times each has been seen
Note
Not settable
Type: Counter
-
items
() → Generator[Tuple[str, int], None, None][source]¶ Iterator over the words in the dictionary
Yields: str – The next word in the dictionary int: The number of instances in the dictionary Note
This is the same as dict.items()
-
keys
() → Generator[str, None, None][source]¶ Iterator over the key of the dictionary
Yields: str – The next key in the dictionary Note
This is the same as spellchecker.words()
-
letters
¶ The listing of all letters found within the corpus
Note
Not settable
Type: set
-
load_dictionary
(filename: Union[pathlib.Path, str], encoding: str = 'utf-8') → None[source]¶ Load in a pre-built word frequency list
Parameters: - filename (str) – The filepath to the json (optionally gzipped) file to be loaded
- encoding (str) – The encoding of the dictionary
-
load_json
(data: Dict[str, int]) → None[source]¶ Load in a pre-built word frequency list
Parameters: data (dict) – The dictionary to be loaded
-
load_text
(text: Union[str, bytes], tokenizer: Optional[Callable[[str], Iterable[str]]] = None) → None[source]¶ Load text from which to generate a word frequency list
Parameters: - text (str) – The text to be loaded
- tokenizer (function) – The function to use to tokenize a string
-
load_text_file
(filename: Union[pathlib.Path, str], encoding: str = 'utf-8', tokenizer: Optional[Callable[[str], Iterable[str]]] = None) → None[source]¶ Load in a text file from which to generate a word frequency list
Parameters: - filename (str) – The filepath to the text file to be loaded
- encoding (str) – The encoding of the text file
- tokenizer (function) – The function to use to tokenize a string
-
load_words
(words: Iterable[Union[str, bytes]]) → None[source]¶ Load a list of words from which to generate a word frequency list
Parameters: words (list) – The list of words to be loaded
-
longest_word_length
¶ The longest word length in the dictionary
Note
Not settable
Type: int
-
pop
(key: Union[str, bytes], default: Optional[int] = None) → int[source]¶ Remove the key and return the associated value or default if not found
Parameters: - key (str) – The key to remove
- default (obj) – The value to return if key is not present
-
remove
(word: Union[str, bytes]) → None[source]¶ Remove a word from the word frequency list
Parameters: word (str) – The word to remove
-
remove_by_threshold
(threshold: int = 5) → None[source]¶ Remove all words at, or below, the provided threshold
Parameters: threshold (int) – The threshold at which a word is to be removed
-
remove_words
(words: Iterable[Union[str, bytes]]) → None[source]¶ Remove a list of words from the word frequency list
Parameters: words (list) – The list of words to remove
-
tokenize
(text: Union[str, bytes]) → Generator[str, None, None][source]¶ Tokenize the provided string object into individual words
Parameters: text (str) – The string object to tokenize Yields: str – The next word in the tokenized string Note
This is the same as the spellchecker.split_words() unless a tokenizer function was provided.
-
total_words
¶ The sum of all word occurrences in the word frequency dictionary
Note
Not settable
Type: int
-
unique_words
¶ The total number of unique words in the word frequency list
Note
Not settable
Type: int
-