meguru_tokenizer.base_tokenizer module¶

class meguru_tokenizer.base_tokenizer.Tokenizer(normalize: bool, lower: bool, language: str = 'unk')[source]¶

Bases: abc.ABC

Base Tokenizer

abstract decode(tokens: List[int])[source]¶

decode a sentence

Example

>>> tokenize.tokenizer([2, 3, 1, 4])
"おはようございます。"

abstract encode(sentnece: str)[source]¶

encode a sentence

Example

>>> tokenize.tokenizer("おはようございます。")
[2, 3, 1, 4]

tokenize(sentence: str)[source]¶

tokenize a sentence

Example

>>> tokenizer.tokenize("おはようございます。")
["おはよう", "ござい", "ます", "。"]

tokenize_list(sentences: List[str])[source]¶

tokenize list of sentence :param sentences: sentence list :type sentences: List[str]

Examples

>>> tokenizer.tokenize(["おはようございます。"])
[["おはよう", "ござい", "ます", "。"]]

abstract vocab_size()[source]¶

vocaburary size