meguru_tokenizer.sentencepiece_tokenizer module¶

class meguru_tokenizer.sentencepiece_tokenizer.SentencePieceTokenizer(normalize: bool = True, lower: bool = True, language: str = 'unk')[source]¶

Bases: meguru_tokenizer.base_tokenizer.Tokenizer

tokenizer splits by SentencePiece

Examples

>>> tokenizer = SentencePieceTokenizer(lower=True, language="ja")
>>> sentences = [
>>>     "Hello, I don't know how to use it?",
>>>     "Tensorflow is awesome!",
>>>     "it is good framework.",
>>> ]
>>> source_file = Path("test.txt")
>>> with source_file.open("w", encoding="utf-8") as f:
>>>     for s in sentences:
>>>         f.write(s + "\n")
>>> tokenizer.train_sp(source_file, vocab_size=37)
>>> print("vocabs:")
>>> with Path("m.vocab").open("r", encoding="utf-8") as f:
>>>     line = f.readline()
>>>     while line:
>>>         w, idx = line.strip().split()
>>>         print(f"{w} {idx}")
>>>         line = f.readline()
vocabs:
<pad> 0
<s> 0
</s> 0
<unk> 0
<mask> 0
▁ -1.85354
o -2.41476
...
>>> print("tokenized sentence")
>>> print(tokenizer.tokenize_list(sentences))
[['▁', 'h', 'e', 'l', 'lo', ',', '▁i', '▁', ...
>>> print("encoded sentence")
>>> print([tokenizer.encode(sentence) for sentence in sentences])
[[5, 31, 9, 22, 19, 25, 12, 5, 13, 6, 10, ...
>>> pretokens = [tokenizer.encode(sentence) for sentence in sentences]
>>> print("decode sentence")
>>> print([tokenizer.decode(tokens) for tokens in pretokens])
["hello, i don't know how to use it?", 'tensorflow is awesome!', 'it is good framework.']
>>> print("reload from dump file")
>>> tokenizer = SentencePieceTokenizer(lower=True, language="ja")
>>> tokenizer.load_sp_model("m")
>>> print("tokenized sentence")
>>> print(tokenizer.tokenize_list(sentences))
[['▁', 'h', 'e', 'l', 'lo', ',', '▁i', '▁', ...
>>> print("encoded sentence")
>>> print([tokenizer.encode(sentence) for sentence in sentences])
[[5, 31, 9, 22, 19, 25, 12, 5, 13, 6, 10, ...
>>> assert pretokens == [tokenizer.encode(sentence) for sentence in sentences]

decode(tokens: List[int])[source]¶

decode a sentence

Parameters: tokens (List[int]) – tokens
Returns: a sentence
Return type: str

Example

>>> tokenize.tokenizer([2, 3, 1, 4])
"おはようございます。"

encode(sentence: str)[source]¶

encode a sentence

Parameters: sentnece (str) – a sentence
Returns: tokens
Return type: List[int]

Example

>>> tokenize.tokenizer("おはようございます。")
[2, 3, 1, 4]

languages = ['ja', 'en', 'de']¶

load_sp_model(prefix: str)[source]¶

tokenize(sentence: str)[source]¶

tokenize a sentence

Parameters: sentence (str) – a sentence
Returns: words
Return type: List[str]

Example

>>> tokenizer.tokenize("おはようございます。")
["おはよう", "ござい", "ます", "。"]

tokenize_list(sentences: List[str])[source]¶

tokenize list of sentence :param sentences: sentence list :type sentences: List[str]

Returns: list of listed words
Return type: List[List[str]]

Examples

>>> tokenizer.tokenize(["おはようございます。"])
[["おはよう", "ござい", "ます", "。"]]

train_sp(resource_flile: str, model_prefix: str = 'm', vocab_size: int = 8000, character_coverage: float = 0.995, model_type='unigram', user_defined_symbols: Tuple[str] = '<mask>')[source]¶

train sentencepiece model

Parameters

resource_file (str) – file for training sentencepiece
vocab_size (int) – vocaburary size e.g. 8000, 16000
character_coverage (float) – character coverage [0, 1] default 0.995
model_type (str) – [‘unigram’, ‘char’, ‘bpe’, ‘word’] ref. https://github.com/google/sentencepiece
user_defined_symbols (List[str]) – special tokens such as “<mask>”

Note

resource_file’s is “sentence per line” pre_defined_symbols : <UNK>, <s>, </s>, <pad>

vocab_size()[source]¶

vocaburary size

Returns: vocab_size
Return type: int

class meguru_tokenizer.sentencepiece_tokenizer.SentencePieceVocab(sp: sentencepiece.SentencePieceProcessor)[source]¶

Bases: meguru_tokenizer.vocab.BaseVocab

idx2word(idx: int)[source]¶

word2idx(word: str)[source]¶