meguru_tokenizer.vocab module¶
-
class
meguru_tokenizer.vocab.
BaseVocab
(unk: str = '<unk>', pad: str = '<pad>', bos: str = '<s>', eos: str = '</s>', mask: str = '<mask>')[source]¶ Bases:
object
-
class
meguru_tokenizer.vocab.
Vocab
(unk: str = '<unk>', pad: str = '<pad>', bos: str = '<s>', eos: str = '</s>', mask: str = '<mask>')[source]¶ Bases:
meguru_tokenizer.vocab.BaseVocab
-
add_vocab
(word: str)[source]¶ add a word into vocaburary to construct vocaburary list
- Parameters
word (str) – a word
-
add_vocabs
(words: List[str])[source]¶ add a words into vocaburary to construct vocaburary list
- Parameters
words (List[str]) – list of word
Example
>>> words = sentence.split() >>> vocab.add_vocabs(words)
-
build_vocab
(min_freq: Optional[int] = None, vocab_size: Optional[int] = None)[source]¶ build vocaburary list from added vocabs
- Parameters
min_freq (Optional[int]) – minimum frequency of the vocab
vocab_size (Optional[int]) – maximum vocaburary size
Note
when vocaburary is builded, the source of vocaburary will be removed to free memory space.
-