meguru_tokenizer.base_tokenizer module

class meguru_tokenizer.base_tokenizer.Tokenizer(normalize: bool, lower: bool, language: str = 'unk')[source]

Bases: abc.ABC

Base Tokenizer

tokenizer

tokenizer e.g. MeCab, Sudachi

abstract decode(tokens: List[int])[source]

decode a sentence

Parameters

tokens (List[int]) – tokens

Returns

a sentence

Return type

str

Example

>>> tokenize.tokenizer([2, 3, 1, 4])
"おはようございます。"
abstract encode(sentnece: str)[source]

encode a sentence

Parameters

sentnece (str) – a sentence

Returns

tokens

Return type

List[int]

Example

>>> tokenize.tokenizer("おはようございます。")
[2, 3, 1, 4]
languages = []
tokenize(sentence: str)[source]

tokenize a sentence

Parameters

sentence (str) – a sentence

Returns

words

Return type

List[str]

Example

>>> tokenizer.tokenize("おはようございます。")
["おはよう", "ござい", "ます", "。"]
tokenize_list(sentences: List[str])[source]

tokenize list of sentence :param sentences: sentence list :type sentences: List[str]

Returns

list of listed words

Return type

List[List[str]]

Examples

>>> tokenizer.tokenize(["おはようございます。"])
[["おはよう", "ござい", "ます", "。"]]
abstract vocab_size()[source]

vocaburary size

Returns

vocab_size

Return type

int