meguru_tokenizer.sentencepiece_tokenizer module

class meguru_tokenizer.sentencepiece_tokenizer.SentencePieceTokenizer(normalize: bool = True, lower: bool = True, language: str = 'unk')[source]

Bases: meguru_tokenizer.base_tokenizer.Tokenizer

tokenizer splits by SentencePiece

Examples

>>> tokenizer = SentencePieceTokenizer(lower=True, language="ja")
>>> sentences = [
>>>     "Hello, I don't know how to use it?",
>>>     "Tensorflow is awesome!",
>>>     "it is good framework.",
>>> ]
>>> source_file = Path("test.txt")
>>> with source_file.open("w", encoding="utf-8") as f:
>>>     for s in sentences:
>>>         f.write(s + "\n")
>>> tokenizer.train_sp(source_file, vocab_size=37)
>>> print("vocabs:")
>>> with Path("m.vocab").open("r", encoding="utf-8") as f:
>>>     line = f.readline()
>>>     while line:
>>>         w, idx = line.strip().split()
>>>         print(f"{w} {idx}")
>>>         line = f.readline()
vocabs:
<pad> 0
<s> 0
</s> 0
<unk> 0
<mask> 0
▁ -1.85354
o -2.41476
...
>>> print("tokenized sentence")
>>> print(tokenizer.tokenize_list(sentences))
[['▁', 'h', 'e', 'l', 'lo', ',', '▁i', '▁', ...
>>> print("encoded sentence")
>>> print([tokenizer.encode(sentence) for sentence in sentences])
[[5, 31, 9, 22, 19, 25, 12, 5, 13, 6, 10, ...
>>> pretokens = [tokenizer.encode(sentence) for sentence in sentences]
>>> print("decode sentence")
>>> print([tokenizer.decode(tokens) for tokens in pretokens])
["hello, i don't know how to use it?", 'tensorflow is awesome!', 'it is good framework.']
>>> print("reload from dump file")
>>> tokenizer = SentencePieceTokenizer(lower=True, language="ja")
>>> tokenizer.load_sp_model("m")
>>> print("tokenized sentence")
>>> print(tokenizer.tokenize_list(sentences))
[['▁', 'h', 'e', 'l', 'lo', ',', '▁i', '▁', ...
>>> print("encoded sentence")
>>> print([tokenizer.encode(sentence) for sentence in sentences])
[[5, 31, 9, 22, 19, 25, 12, 5, 13, 6, 10, ...
>>> assert pretokens == [tokenizer.encode(sentence) for sentence in sentences]
decode(tokens: List[int])[source]

decode a sentence

Parameters

tokens (List[int]) – tokens

Returns

a sentence

Return type

str

Example

>>> tokenize.tokenizer([2, 3, 1, 4])
"おはようございます。"
encode(sentence: str)[source]

encode a sentence

Parameters

sentnece (str) – a sentence

Returns

tokens

Return type

List[int]

Example

>>> tokenize.tokenizer("おはようございます。")
[2, 3, 1, 4]
languages = ['ja', 'en', 'de']
load_sp_model(prefix: str)[source]
tokenize(sentence: str)[source]

tokenize a sentence

Parameters

sentence (str) – a sentence

Returns

words

Return type

List[str]

Example

>>> tokenizer.tokenize("おはようございます。")
["おはよう", "ござい", "ます", "。"]
tokenize_list(sentences: List[str])[source]

tokenize list of sentence :param sentences: sentence list :type sentences: List[str]

Returns

list of listed words

Return type

List[List[str]]

Examples

>>> tokenizer.tokenize(["おはようございます。"])
[["おはよう", "ござい", "ます", "。"]]
train_sp(resource_flile: str, model_prefix: str = 'm', vocab_size: int = 8000, character_coverage: float = 0.995, model_type='unigram', user_defined_symbols: Tuple[str] = '<mask>')[source]

train sentencepiece model

Parameters
  • resource_file (str) – file for training sentencepiece

  • vocab_size (int) – vocaburary size e.g. 8000, 16000

  • character_coverage (float) – character coverage [0, 1] default 0.995

  • model_type (str) – [‘unigram’, ‘char’, ‘bpe’, ‘word’] ref. https://github.com/google/sentencepiece

  • user_defined_symbols (List[str]) – special tokens such as “<mask>”

Note

resource_file’s is “sentence per line” pre_defined_symbols : <UNK>, <s>, </s>, <pad>

vocab_size()[source]

vocaburary size

Returns

vocab_size

Return type

int

class meguru_tokenizer.sentencepiece_tokenizer.SentencePieceVocab(sp: sentencepiece.SentencePieceProcessor)[source]

Bases: meguru_tokenizer.vocab.BaseVocab

idx2word(idx: int)[source]
word2idx(word: str)[source]