pyspellchecker API
Here you can find the full developer API for the pyspellchecker project. pyspellchecker provides a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.
SpellChecker
- class spellchecker.SpellChecker(language: str | Iterable[str] | None = 'en', local_dictionary: Path | str | None = None, distance: int = 2, tokenizer: Callable[[str], Iterable[str]] | None = None, case_sensitive: bool = False)[source]
The SpellChecker class encapsulates the basics needed to accomplish a simple spell checking algorithm. It is based on the work by Peter Norvig (https://norvig.com/spell-correct.html)
- Parameters:
language (str) – The language of the dictionary to load or None for no dictionary. Supported languages are en, es, it, de, fr, pt, ru, lv, eu, and nl. Defaults to en. A list of languages may be provided and all languages will be loaded.
local_dictionary (str) – The path to a locally stored word frequency dictionary; if provided, no language will be loaded
distance (int) – The edit distance to use. Defaults to 2.
case_sensitive (bool) – Flag to use a case sensitive dictionary or not, only available when not using a language dictionary.
Note
Using a case sensitive dictionary can be slow to correct words.
- candidates(word: str | bytes) Set[str] | None [source]
Generate possible spelling corrections for the provided word up to an edit distance of two, if and only when needed
- Parameters:
word (str) – The word for which to calculate candidate spellings
- Returns:
The set of words that are possible candidates or None if there are no candidates
- Return type:
set
- correction(word: str | bytes) str | None [source]
The most probable correct spelling for the word
- Parameters:
word (str) – The word to correct
- Returns:
The most likely candidate or None if no correction is present
- Return type:
str
- property distance: int
The maximum edit distance to calculate
Note
Valid values are 1 or 2; if an invalid value is passed, defaults to 2
- Type:
int
- edit_distance_1(word: str | bytes) Set[str] [source]
Compute all strings that are one edit away from word using only the letters in the corpus
- Parameters:
word (str) – The word for which to calculate the edit distance
- Returns:
The set of strings that are edit distance one from the provided word
- Return type:
set
- edit_distance_2(word: str | bytes) List[str] [source]
Compute all strings that are two edits away from word using only the letters in the corpus
- Parameters:
word (str) – The word for which to calculate the edit distance
- Returns:
The set of strings that are edit distance two from the provided word
- Return type:
set
- export(filepath: Path | str, encoding: str = 'utf-8', gzipped: bool = True) None [source]
Export the word frequency list for import in the future
- Parameters:
filepath (str) – The filepath to the exported dictionary
encoding (str) – The encoding of the resulting output
gzipped (bool) – Whether to gzip the dictionary or not
- known(words: Iterable[str | bytes]) Set[str] [source]
The subset of words that appear in the dictionary of words
- Parameters:
words (list) – List of words to determine which are in the corpus
- Returns:
The set of those words from the input that are in the corpus
- Return type:
set
- classmethod languages() Iterable[str] [source]
list: A list of all official languages supported by the library
- split_words(text: str | bytes) Iterable[str] [source]
Split text into individual words using either a simple whitespace regex or the passed in tokenizer
- Parameters:
text (str) – The text to split into individual words
- Returns:
A listing of all words in the provided text
- Return type:
list(str)
- unknown(words: Iterable[str | bytes]) Set[str] [source]
The subset of words that do not appear in the dictionary
- Parameters:
words (list) – List of words to determine which are not in the corpus
- Returns:
The set of those words from the input that are not in the corpus
- Return type:
set
- property word_frequency: WordFrequency
An encapsulation of the word frequency dictionary
Note
Not settable
- Type:
- word_usage_frequency(word: str | bytes, total_words: int | None = None) float [source]
Calculate the frequency to the word provided as seen across the entire dictionary
- Parameters:
word (str) – The word for which the word probability is calculated
total_words (int) – The total number of words to use in the calculation; use the default for using the whole word frequency
- Returns:
The probability that the word is the correct word
- Return type:
float
WordFrequency
- class spellchecker.WordFrequency(tokenizer: Callable[[str], Iterable[str]] | None = None, case_sensitive: bool = False)[source]
Store the dictionary as a word frequency list while allowing for different methods to load the data and update over time
- add(word: str | bytes, val: int = 1) None [source]
Add a word to the word frequency list
- Parameters:
word (str) – The word to add
val (int) – The number of times to insert the word
- property dictionary: Dict[str, int]
A counting dictionary of all words in the corpus and the number of times each has been seen
Note
Not settable
- Type:
Counter
- items() Generator[Tuple[str, int], None, None] [source]
Iterator over the words in the dictionary
- Yields:
str – The next word in the dictionary int: The number of instances in the dictionary
Note
This is the same as dict.items()
- keys() Iterator[str] [source]
Iterator over the key of the dictionary
- Yields:
str – The next key in the dictionary
Note
This is the same as spellchecker.words()
- property letters: Set[str]
The listing of all letters found within the corpus
Note
Not settable
- Type:
set
- load_dictionary(filename: Path | str, encoding: str = 'utf-8') None [source]
Load in a pre-built word frequency list
- Parameters:
filename (str) – The filepath to the json (optionally gzipped) file to be loaded
encoding (str) – The encoding of the dictionary
- load_json(data: Dict[str, int]) None [source]
Load in a pre-built word frequency list
- Parameters:
data (dict) – The dictionary to be loaded
- load_text(text: str | bytes, tokenizer: Callable[[str], Iterable[str]] | None = None) None [source]
Load text from which to generate a word frequency list
- Parameters:
text (str) – The text to be loaded
tokenizer (function) – The function to use to tokenize a string
- load_text_file(filename: Path | str, encoding: str = 'utf-8', tokenizer: Callable[[str], Iterable[str]] | None = None) None [source]
Load in a text file from which to generate a word frequency list
- Parameters:
filename (str) – The filepath to the text file to be loaded
encoding (str) – The encoding of the text file
tokenizer (function) – The function to use to tokenize a string
- load_words(words: Iterable[str | bytes]) None [source]
Load a list of words from which to generate a word frequency list
- Parameters:
words (list) – The list of words to be loaded
- property longest_word_length: int
The longest word length in the dictionary
Note
Not settable
- Type:
int
- pop(key: str | bytes, default: int | None = None) int | None [source]
Remove the key and return the associated value or default if not found
- Parameters:
key (str) – The key to remove
default (obj) – The value to return if key is not present
- Returns:
Returns the number of instances of key, or None if not in the dictionary
- Return type:
int | None
- remove(word: str | bytes) None [source]
Remove a word from the word frequency list
- Parameters:
word (str) – The word to remove
- remove_by_threshold(threshold: int = 5) None [source]
Remove all words at, or below, the provided threshold
- Parameters:
threshold (int) – The threshold at which a word is to be removed
- remove_words(words: Iterable[str | bytes]) None [source]
Remove a list of words from the word frequency list
- Parameters:
words (list) – The list of words to remove
- tokenize(text: str | bytes) Iterator[str] [source]
Tokenize the provided string object into individual words
- Parameters:
text (str) – The string object to tokenize
- Yields:
str – The next word in the tokenized string
Note
This is the same as the spellchecker.split_words() unless a tokenizer function was provided.
- property total_words: int
The sum of all word occurrences in the word frequency dictionary
Note
Not settable
- Type:
int
- property unique_words: int
The total number of unique words in the word frequency list
Note
Not settable
- Type:
int