meguru_tokenizer.sudachi_tokenizer module¶
-
class
meguru_tokenizer.sudachi_tokenizer.
SudachiTokenizer
(vocab: Optional[meguru_tokenizer.vocab.Vocab] = None, normalize: bool = True, sudachi_normalize: bool = False, lower: bool = False, language: bool = 'unk', enable_gpu: bool = False)[source]¶ Bases:
meguru_tokenizer.base_tokenizer.Tokenizer
tokenizer splits by sudachi
Example
>>> import pprint >>> tokenizer = SudachiTokenizer(language="ja") >>> sentences = ["銀座でランチをご一緒しましょう。", "締切間に合いますか?", "トークナイザを作りました。"] >>> vocab = Vocab() >>> for sentence in sentences: >>> vocab.add_vocabs(tokenizer.tokenize(sentence)) >>> vocab.build_vocab() >>> tokenizer.vocab = vocab >>> vocab.dump_vocab(Path("vocab.txt")) >>> print("vocabs:") >>> pprint.pprint(vocab.i2w) vocabs: {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: 'を', 6: '。', 7: '銀座', 8: 'で', 9: 'ランチ', 10: 'ご', 11: '一緒', 12: 'し', 13: 'ましょう', 14: '締切', 15: '間に合い', 16: 'ます', 17: 'か', 18: '?', 19: 'トークナイザ', 20: '作り', 21: 'まし', 22: 'た'} >>> print("tokenized sentence") >>> pprint.pprint(tokenizer.tokenize_list(sentences)) >>> print("encoded sentence") [['銀座', 'で', 'ランチ', 'を', 'ご', '一緒', 'し', 'ましょう', '。'], ['締切', '間に合い', 'ます', 'か', '?'], ['トークナイザ', 'を', '作り', 'まし', 'た', '。']] >>> pprint.pprint([tokenizer.encode(sentence) for sentence in sentences]) [[19, 5, 20, 21, 22, 6], [19, 5, 20, 21, 22, 6], [19, 5, 20, 21, 22, 6]] >>> encodes = [] >>> for sentence in sentences: >>> encodes.append(tokenizer.encode(sentence)) >>> print("decoded sentence") >>> pprint.pprint([tokenizer.decode(tokens) for tokens in encodes]) ['銀座 で ランチ を ご 一緒 し ましょう 。', '締切 間に合い ます か ?', 'トークナイザ を 作り まし た 。']
-
decode
(tokens: List[int])[source]¶ decode a sentence
- Parameters
tokens (List[int]) – tokens
- Returns
a sentence
- Return type
str
Example
>>> tokenize.tokenizer([2, 3, 1, 4]) "おはようございます。"
-
encode
(sentence: str)[source]¶ encode a sentence
- Parameters
sentnece (str) – a sentence
- Returns
tokens
- Return type
List[int]
Example
>>> tokenize.tokenizer("おはようございます。") [2, 3, 1, 4]
-
languages
= ['ja']¶
-
tokenize
(sentence: str)[source]¶ tokenize a sentence :param sentence: a sentence :type sentence: str
- Retuens:
tokens(Tuple[str]): tokens
Example
>>> tokenizer.tokenize("おはようございます。おやすみなさい", True) ["おはよう", "ござい", "ます", "おやすみ", "なさい"] >>> tokenizer.tokenize("おはようございます。おやすみなさい", False) [["おはよう", "ござい", "ます"], ["おやすみ", "なさい"]]
-