meguru_tokenizer package¶
Subpackages¶
Submodules¶
Module contents¶
-
class
meguru_tokenizer.
Tokenizer
(normalize: bool, lower: bool, language: str = 'unk')[source]¶ Bases:
abc.ABC
Base Tokenizer
-
tokenizer
¶ tokenizer e.g. MeCab, Sudachi
-
abstract
decode
(tokens: List[int])[source]¶ decode a sentence
- Parameters
tokens (List[int]) – tokens
- Returns
a sentence
- Return type
str
Example
>>> tokenize.tokenizer([2, 3, 1, 4]) "おはようございます。"
-
abstract
encode
(sentnece: str)[source]¶ encode a sentence
- Parameters
sentnece (str) – a sentence
- Returns
tokens
- Return type
List[int]
Example
>>> tokenize.tokenizer("おはようございます。") [2, 3, 1, 4]
-
languages
= []¶
-
tokenize
(sentence: str)[source]¶ tokenize a sentence
- Parameters
sentence (str) – a sentence
- Returns
words
- Return type
List[str]
Example
>>> tokenizer.tokenize("おはようございます。") ["おはよう", "ござい", "ます", "。"]
-