meguru_tokenizer.sudachi_tokenizer module

class meguru_tokenizer.sudachi_tokenizer.SudachiTokenizer(vocab: Optional[meguru_tokenizer.vocab.Vocab] = None, normalize: bool = True, sudachi_normalize: bool = False, lower: bool = False, language: bool = 'unk', enable_gpu: bool = False)[source]

Bases: meguru_tokenizer.base_tokenizer.Tokenizer

tokenizer splits by sudachi

Example

>>> import pprint
>>> tokenizer = SudachiTokenizer(language="ja")
>>> sentences = ["銀座でランチをご一緒しましょう。", "締切間に合いますか?", "トークナイザを作りました。"]
>>> vocab = Vocab()
>>> for sentence in sentences:
>>>     vocab.add_vocabs(tokenizer.tokenize(sentence))
>>> vocab.build_vocab()
>>> tokenizer.vocab = vocab
>>> vocab.dump_vocab(Path("vocab.txt"))
>>> print("vocabs:")
>>> pprint.pprint(vocab.i2w)
vocabs:
{0: '<pad>',
 1: '<s>',
 2: '</s>',
 3: '<unk>',
 4: '<mask>',
 5: 'を',
 6: '。',
 7: '銀座',
 8: 'で',
 9: 'ランチ',
 10: 'ご',
 11: '一緒',
 12: 'し',
 13: 'ましょう',
 14: '締切',
 15: '間に合い',
 16: 'ます',
 17: 'か',
 18: '?',
 19: 'トークナイザ',
 20: '作り',
 21: 'まし',
 22: 'た'}        >>> print("tokenized sentence")
>>> pprint.pprint(tokenizer.tokenize_list(sentences))
>>> print("encoded sentence")
[['銀座', 'で', 'ランチ', 'を', 'ご', '一緒', 'し', 'ましょう', '。'],
 ['締切', '間に合い', 'ます', 'か', '?'],
 ['トークナイザ', 'を', '作り', 'まし', 'た', '。']]
>>> pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])
[[19, 5, 20, 21, 22, 6], [19, 5, 20, 21, 22, 6], [19, 5, 20, 21, 22, 6]]
>>> encodes = []
>>> for sentence in sentences:
>>>     encodes.append(tokenizer.encode(sentence))
>>> print("decoded sentence")
>>> pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
['銀座 で ランチ を ご 一緒 し ましょう 。', '締切 間に合い ます か ?', 'トークナイザ を 作り まし た 。']
decode(tokens: List[int])[source]

decode a sentence

Parameters

tokens (List[int]) – tokens

Returns

a sentence

Return type

str

Example

>>> tokenize.tokenizer([2, 3, 1, 4])
"おはようございます。"
encode(sentence: str)[source]

encode a sentence

Parameters

sentnece (str) – a sentence

Returns

tokens

Return type

List[int]

Example

>>> tokenize.tokenizer("おはようございます。")
[2, 3, 1, 4]
languages = ['ja']
tokenize(sentence: str)[source]

tokenize a sentence :param sentence: a sentence :type sentence: str

Retuens:

tokens(Tuple[str]): tokens

Example

>>> tokenizer.tokenize("おはようございます。おやすみなさい", True)
["おはよう", "ござい", "ます", "おやすみ", "なさい"]
>>> tokenizer.tokenize("おはようございます。おやすみなさい", False)
[["おはよう", "ござい", "ます"], ["おやすみ", "なさい"]]
tokenize_list(sentences: List[str])[source]

tokenize sentences

Parameters

sentences (List[str]) – sentences

Retuens:

tokens(List[Tuple[str]]): list of tokens

Example

>>> tokenizer.tokenize(["おはようございます", "こんにちは"])
[["おはよう", "ござい", "ます"], ["こんにちは"]]
vocab_size()[source]

vocaburary size

Returns

vocab_size

Return type

int