pyspellchecker API

Here you can find the full developer API for the pyspellchecker project. pyspellchecker provides a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.

SpellChecker

class spellchecker.SpellChecker(language: str | Iterable[str] | None = 'en', local_dictionary: Path | str | None = None, distance: int = 2, tokenizer: Callable[[str], Iterable[str]] | None = None, case_sensitive: bool = False)[source]

The SpellChecker class encapsulates the basics needed to accomplish a simple spell checking algorithm. It is based on the work by Peter Norvig (https://norvig.com/spell-correct.html)

Parameters:
  • language (str) – The language of the dictionary to load or None for no dictionary. Supported languages are en, es, it, de, fr, pt, ru, lv, eu, and nl. Defaults to en. A list of languages may be provided and all languages will be loaded.

  • local_dictionary (str) – The path to a locally stored word frequency dictionary; if provided, no language will be loaded

  • distance (int) – The edit distance to use. Defaults to 2.

  • case_sensitive (bool) – Flag to use a case sensitive dictionary or not, only available when not using a language dictionary.

Note

Using a case sensitive dictionary can be slow to correct words.

candidates(word: str | bytes) Set[str] | None[source]

Generate possible spelling corrections for the provided word up to an edit distance of two, if and only when needed

Parameters:

word (str) – The word for which to calculate candidate spellings

Returns:

The set of words that are possible candidates or None if there are no candidates

Return type:

set

correction(word: str | bytes) str | None[source]

The most probable correct spelling for the word

Parameters:

word (str) – The word to correct

Returns:

The most likely candidate or None if no correction is present

Return type:

str

property distance: int

The maximum edit distance to calculate

Note

Valid values are 1 or 2; if an invalid value is passed, defaults to 2

Type:

int

edit_distance_1(word: str | bytes) Set[str][source]

Compute all strings that are one edit away from word using only the letters in the corpus

Parameters:

word (str) – The word for which to calculate the edit distance

Returns:

The set of strings that are edit distance one from the provided word

Return type:

set

edit_distance_2(word: str | bytes) List[str][source]

Compute all strings that are two edits away from word using only the letters in the corpus

Parameters:

word (str) – The word for which to calculate the edit distance

Returns:

The set of strings that are edit distance two from the provided word

Return type:

set

export(filepath: Path | str, encoding: str = 'utf-8', gzipped: bool = True) None[source]

Export the word frequency list for import in the future

Parameters:
  • filepath (str) – The filepath to the exported dictionary

  • encoding (str) – The encoding of the resulting output

  • gzipped (bool) – Whether to gzip the dictionary or not

known(words: Iterable[str | bytes]) Set[str][source]

The subset of words that appear in the dictionary of words

Parameters:

words (list) – List of words to determine which are in the corpus

Returns:

The set of those words from the input that are in the corpus

Return type:

set

classmethod languages() Iterable[str][source]

list: A list of all official languages supported by the library

split_words(text: str | bytes) Iterable[str][source]

Split text into individual words using either a simple whitespace regex or the passed in tokenizer

Parameters:

text (str) – The text to split into individual words

Returns:

A listing of all words in the provided text

Return type:

list(str)

unknown(words: Iterable[str | bytes]) Set[str][source]

The subset of words that do not appear in the dictionary

Parameters:

words (list) – List of words to determine which are not in the corpus

Returns:

The set of those words from the input that are not in the corpus

Return type:

set

property word_frequency: WordFrequency

An encapsulation of the word frequency dictionary

Note

Not settable

Type:

WordFrequency

word_usage_frequency(word: str | bytes, total_words: int | None = None) float[source]

Calculate the frequency to the word provided as seen across the entire dictionary

Parameters:
  • word (str) – The word for which the word probability is calculated

  • total_words (int) – The total number of words to use in the calculation; use the default for using the whole word frequency

Returns:

The probability that the word is the correct word

Return type:

float

WordFrequency

class spellchecker.WordFrequency(tokenizer: Callable[[str], Iterable[str]] | None = None, case_sensitive: bool = False)[source]

Store the dictionary as a word frequency list while allowing for different methods to load the data and update over time

add(word: str | bytes, val: int = 1) None[source]

Add a word to the word frequency list

Parameters:
  • word (str) – The word to add

  • val (int) – The number of times to insert the word

property dictionary: Dict[str, int]

A counting dictionary of all words in the corpus and the number of times each has been seen

Note

Not settable

Type:

Counter

items() Generator[Tuple[str, int], None, None][source]

Iterator over the words in the dictionary

Yields:

str – The next word in the dictionary int: The number of instances in the dictionary

Note

This is the same as dict.items()

keys() Iterator[str][source]

Iterator over the key of the dictionary

Yields:

str – The next key in the dictionary

Note

This is the same as spellchecker.words()

property letters: Set[str]

The listing of all letters found within the corpus

Note

Not settable

Type:

set

load_dictionary(filename: Path | str, encoding: str = 'utf-8') None[source]

Load in a pre-built word frequency list

Parameters:
  • filename (str) – The filepath to the json (optionally gzipped) file to be loaded

  • encoding (str) – The encoding of the dictionary

load_json(data: Dict[str, int]) None[source]

Load in a pre-built word frequency list

Parameters:

data (dict) – The dictionary to be loaded

load_text(text: str | bytes, tokenizer: Callable[[str], Iterable[str]] | None = None) None[source]

Load text from which to generate a word frequency list

Parameters:
  • text (str) – The text to be loaded

  • tokenizer (function) – The function to use to tokenize a string

load_text_file(filename: Path | str, encoding: str = 'utf-8', tokenizer: Callable[[str], Iterable[str]] | None = None) None[source]

Load in a text file from which to generate a word frequency list

Parameters:
  • filename (str) – The filepath to the text file to be loaded

  • encoding (str) – The encoding of the text file

  • tokenizer (function) – The function to use to tokenize a string

load_words(words: Iterable[str | bytes]) None[source]

Load a list of words from which to generate a word frequency list

Parameters:

words (list) – The list of words to be loaded

property longest_word_length: int

The longest word length in the dictionary

Note

Not settable

Type:

int

pop(key: str | bytes, default: int | None = None) int | None[source]

Remove the key and return the associated value or default if not found

Parameters:
  • key (str) – The key to remove

  • default (obj) – The value to return if key is not present

Returns:

Returns the number of instances of key, or None if not in the dictionary

Return type:

int | None

remove(word: str | bytes) None[source]

Remove a word from the word frequency list

Parameters:

word (str) – The word to remove

remove_by_threshold(threshold: int = 5) None[source]

Remove all words at, or below, the provided threshold

Parameters:

threshold (int) – The threshold at which a word is to be removed

remove_words(words: Iterable[str | bytes]) None[source]

Remove a list of words from the word frequency list

Parameters:

words (list) – The list of words to remove

tokenize(text: str | bytes) Iterator[str][source]

Tokenize the provided string object into individual words

Parameters:

text (str) – The string object to tokenize

Yields:

str – The next word in the tokenized string

Note

This is the same as the spellchecker.split_words() unless a tokenizer function was provided.

property total_words: int

The sum of all word occurrences in the word frequency dictionary

Note

Not settable

Type:

int

property unique_words: int

The total number of unique words in the word frequency list

Note

Not settable

Type:

int

words() Iterator[str][source]

Iterator over the words in the dictionary

Yields:

str – The next word in the dictionary

Note

This is the same as spellchecker.keys()