meguru_tokenizer.vocab module

class meguru_tokenizer.vocab.BaseVocab(unk: str = '<unk>', pad: str = '<pad>', bos: str = '<s>', eos: str = '</s>', mask: str = '<mask>')[source]

Bases: object

abstract idx2word(idx: int)[source]
abstract word2idx(word: str)[source]
class meguru_tokenizer.vocab.Vocab(unk: str = '<unk>', pad: str = '<pad>', bos: str = '<s>', eos: str = '</s>', mask: str = '<mask>')[source]

Bases: meguru_tokenizer.vocab.BaseVocab

add_vocab(word: str)[source]

add a word into vocaburary to construct vocaburary list

Parameters

word (str) – a word

add_vocabs(words: List[str])[source]

add a words into vocaburary to construct vocaburary list

Parameters

words (List[str]) – list of word

Example

>>> words = sentence.split()
>>> vocab.add_vocabs(words)
build_vocab(min_freq: Optional[int] = None, vocab_size: Optional[int] = None)[source]

build vocaburary list from added vocabs

Parameters
  • min_freq (Optional[int]) – minimum frequency of the vocab

  • vocab_size (Optional[int]) – maximum vocaburary size

Note

when vocaburary is builded, the source of vocaburary will be removed to free memory space.

dump_vocab(export_path: pathlib.Path)[source]

dump vocab

Parameters

export_path (Path) –

idx2word(idx: int)[source]

word to index

Parameters

idx (int) – index of the word

Returns

the word which pairs of the word if the word is not found, will return “<unk>”

Return type

str

load_vocab(load_path: pathlib.Path)[source]

dump vocab

Parameters

load_path (Path) –

word2idx(word: str)[source]

word to index

Parameters

word (str) – a word

Returns

the idx which pairs of the word if idx is not found, will return the idx of “<unk>”

Return type

int