pyspellchecker API

Here you can find the full developer API for the pyspellchecker project. pyspellchecker provides a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.

SpellChecker

The SpellChecker class encapsulates the basics needed to accomplish a simple spell checking algorithm. It is based on the work by Peter Norvig (https://norvig.com/spell-correct.html)

Parameters:

language (str) – The language of the dictionary to load or None for no dictionary. Supported languages are en, es, it, de, fr, pt, ru, lv, eu, nl and fa. Defaults to en. A list of languages may be provided and all languages will be loaded.
local_dictionary (str) – The path to a locally stored word frequency dictionary; if provided, no language will be loaded
distance (int) – The edit distance to use. Defaults to 2.
case_sensitive (bool) – Flag to use a case sensitive dictionary or not, only available when not using a language dictionary.

Note

Using a case sensitive dictionary can be slow to correct words.

Raises:: ValueError – If the provided language dictionary does not exist, if case_sensitive is True with a language dictionary, or if both language and local_dictionary are specified.

candidates(word: str | bytes) → set[str] | None[source]

Generate possible spelling corrections for the provided word up to an edit distance of two, if and only when needed

Parameters:: word (str) – The word for which to calculate candidate spellings
Returns:: The set of words that are possible candidates or None if there are no candidates
Return type:: set

correction(word: str | bytes) → str | None[source]

The most probable correct spelling for the word

Parameters:: word (str) – The word to correct
Returns:: The most likely candidate or None if no correction is present
Return type:: str

property distance: int

The maximum edit distance to calculate

Note

Valid values are 1 or 2; if an invalid value is passed, defaults to 2

Type:: int

edit_distance_1(word: str | bytes) → set[str][source]

Compute all strings that are one edit away from word using only the letters in the corpus

Parameters:: word (str) – The word for which to calculate the edit distance
Returns:: The set of strings that are edit distance one from the provided word
Return type:: set

edit_distance_2(word: str | bytes) → list[str][source]

Compute all strings that are two edits away from word using only the letters in the corpus

Parameters:: word (str) – The word for which to calculate the edit distance
Returns:: The set of strings that are edit distance two from the provided word
Return type:: set

export(filepath: Path | str, encoding: str = 'utf-8', gzipped: bool = True) → None[source]

Export the word frequency list for import in the future

Parameters:

filepath (str) – The filepath to the exported dictionary
encoding (str) – The encoding of the resulting output
gzipped (bool) – Whether to gzip the dictionary or not

known(words: Iterable[str | bytes]) → set[str][source]

The subset of words that appear in the dictionary of words

Parameters:: words (list) – List of words to determine which are in the corpus
Returns:: The set of those words from the input that are in the corpus
Return type:: set

classmethod languages() → Iterable[str][source]: list: A list of all official languages supported by the library

split_words(text: str | bytes) → Iterable[str][source]

Split text into individual words using either a simple whitespace regex or the passed in tokenizer

Parameters:: text (str) – The text to split into individual words
Returns:: A listing of all words in the provided text
Return type:: list(str)

unknown(words: Iterable[str | bytes]) → set[str][source]

The subset of words that do not appear in the dictionary

Parameters:: words (list) – List of words to determine which are not in the corpus
Returns:: The set of those words from the input that are not in the corpus
Return type:: set

property word_frequency: WordFrequency

An encapsulation of the word frequency dictionary

Note

Not settable

Type:: WordFrequency

word_usage_frequency(word: str | bytes, total_words: int | None = None) → float[source]

Calculate the frequency to the word provided as seen across the entire dictionary

Parameters:

word (str) – The word for which the word probability is calculated
total_words (int) – The total number of words to use in the calculation; use the default for using the whole word frequency

Returns:

The probability that the word is the correct word

Return type:

float

WordFrequency

class spellchecker.WordFrequency(tokenizer: Callable[[str], Iterable[str]] | None = None, case_sensitive: bool = False)[source]

Store the dictionary as a word frequency list while allowing for different methods to load the data and update over time

add(word: str | bytes, val: int = 1) → None[source]

Add a word to the word frequency list

Parameters:

word (str) – The word to add
val (int) – The number of times to insert the word

property dictionary: dict[str, int]

A counting dictionary of all words in the corpus and the number of times each has been seen

Note

Not settable

Type:: Counter

items() → Generator[tuple[str, int], None, None][source]

Iterator over the words in the dictionary

Yields:: str – The next word in the dictionary int: The number of instances in the dictionary

Note

This is the same as dict.items()

keys() → Iterator[str][source]

Iterator over the key of the dictionary

Yields:: str – The next key in the dictionary

Note

This is the same as spellchecker.words()

property letters: set[str]

The listing of all letters found within the corpus

Note

Not settable

Type:: set

load_dictionary(filename: Path | str, encoding: str = 'utf-8') → None[source]

Load in a pre-built word frequency list

Parameters:

filename (str) – The filepath to the json (optionally gzipped) file to be loaded
encoding (str) – The encoding of the dictionary

load_json(data: dict[str, int]) → None[source]

Load in a pre-built word frequency list

Parameters:: data (dict) – The dictionary to be loaded

load_text(text: str | bytes, tokenizer: Callable[[str], Iterable[str]] | None = None) → None[source]

Load text from which to generate a word frequency list

Parameters:

text (str) – The text to be loaded
tokenizer (function) – The function to use to tokenize a string

load_text_file(filename: Path | str, encoding: str = 'utf-8', tokenizer: Callable[[str], Iterable[str]] | None = None) → None[source]

Load in a text file from which to generate a word frequency list

Parameters:

filename (str) – The filepath to the text file to be loaded
encoding (str) – The encoding of the text file
tokenizer (function) – The function to use to tokenize a string

load_words(words: Iterable[str | bytes]) → None[source]

Load a list of words from which to generate a word frequency list

Parameters:: words (list) – The list of words to be loaded

property longest_word_length: int

The longest word length in the dictionary

Note

Not settable

Type:: int

pop(key: str | bytes, default: int | None = None) → int | None[source]

Remove the key and return the associated value or default if not found

Parameters:

key (str) – The key to remove
default (obj) – The value to return if key is not present

Returns:

Returns the number of instances of key, or None if not in the dictionary

Return type:

int | None

remove(word: str | bytes) → None[source]

Remove a word from the word frequency list

Parameters:: word (str) – The word to remove

remove_by_threshold(threshold: int = 5) → None[source]

Remove all words at, or below, the provided threshold

Parameters:: threshold (int) – The threshold at which a word is to be removed

remove_words(words: Iterable[str | bytes]) → None[source]

Remove a list of words from the word frequency list

Parameters:: words (list) – The list of words to remove

tokenize(text: str | bytes) → Iterator[str][source]

Tokenize the provided string object into individual words

Parameters:: text (str) – The string object to tokenize
Yields:: str – The next word in the tokenized string

Note

This is the same as the spellchecker.split_words() unless a tokenizer function was provided.

property total_words: int

The sum of all word occurrences in the word frequency dictionary

Note

Not settable

Type:: int

property unique_words: int

The total number of unique words in the word frequency list

Note

Not settable

Type:: int

words() → Iterator[str][source]

Iterator over the words in the dictionary

Yields:: str – The next word in the dictionary

Note

This is the same as spellchecker.keys()