meguru_tokenizer.vocab module¶

class meguru_tokenizer.vocab.BaseVocab(unk: str = '<unk>', pad: str = '<pad>', bos: str = '<s>', eos: str = '</s>', mask: str = '<mask>')[source]¶

Bases: object

abstract idx2word(idx: int)[source]¶

abstract word2idx(word: str)[source]¶

class meguru_tokenizer.vocab.Vocab(unk: str = '<unk>', pad: str = '<pad>', bos: str = '<s>', eos: str = '</s>', mask: str = '<mask>')[source]¶

Bases: meguru_tokenizer.vocab.BaseVocab

add_vocab(word: str)[source]¶

add a word into vocaburary to construct vocaburary list

Parameters: word (str) – a word

add_vocabs(words: List[str])[source]¶

add a words into vocaburary to construct vocaburary list

Parameters: words (List[str]) – list of word

Example

>>> words = sentence.split()
>>> vocab.add_vocabs(words)

build_vocab(min_freq: Optional[int] = None, vocab_size: Optional[int] = None)[source]¶

build vocaburary list from added vocabs

Parameters

min_freq (Optional[int]) – minimum frequency of the vocab
vocab_size (Optional[int]) – maximum vocaburary size

Note

when vocaburary is builded, the source of vocaburary will be removed to free memory space.

dump_vocab(export_path: pathlib.Path)[source]¶

dump vocab

Parameters: export_path (Path) –

idx2word(idx: int)[source]¶

word to index

Parameters: idx (int) – index of the word
Returns: the word which pairs of the word if the word is not found, will return “<unk>”
Return type: str

load_vocab(load_path: pathlib.Path)[source]¶

dump vocab

Parameters: load_path (Path) –

word2idx(word: str)[source]¶

word to index

Parameters: word (str) – a word
Returns: the idx which pairs of the word if idx is not found, will return the idx of “<unk>”
Return type: int