meguru_tokenizer.sentencepiece_tokenizer module¶
-
class
meguru_tokenizer.sentencepiece_tokenizer.
SentencePieceTokenizer
(normalize: bool = True, lower: bool = True, language: str = 'unk')[source]¶ Bases:
meguru_tokenizer.base_tokenizer.Tokenizer
tokenizer splits by SentencePiece
Examples
>>> tokenizer = SentencePieceTokenizer(lower=True, language="ja") >>> sentences = [ >>> "Hello, I don't know how to use it?", >>> "Tensorflow is awesome!", >>> "it is good framework.", >>> ] >>> source_file = Path("test.txt") >>> with source_file.open("w", encoding="utf-8") as f: >>> for s in sentences: >>> f.write(s + "\n") >>> tokenizer.train_sp(source_file, vocab_size=37) >>> print("vocabs:") >>> with Path("m.vocab").open("r", encoding="utf-8") as f: >>> line = f.readline() >>> while line: >>> w, idx = line.strip().split() >>> print(f"{w} {idx}") >>> line = f.readline() vocabs: <pad> 0 <s> 0 </s> 0 <unk> 0 <mask> 0 ▁ -1.85354 o -2.41476 ... >>> print("tokenized sentence") >>> print(tokenizer.tokenize_list(sentences)) [['▁', 'h', 'e', 'l', 'lo', ',', '▁i', '▁', ... >>> print("encoded sentence") >>> print([tokenizer.encode(sentence) for sentence in sentences]) [[5, 31, 9, 22, 19, 25, 12, 5, 13, 6, 10, ... >>> pretokens = [tokenizer.encode(sentence) for sentence in sentences] >>> print("decode sentence") >>> print([tokenizer.decode(tokens) for tokens in pretokens]) ["hello, i don't know how to use it?", 'tensorflow is awesome!', 'it is good framework.'] >>> print("reload from dump file") >>> tokenizer = SentencePieceTokenizer(lower=True, language="ja") >>> tokenizer.load_sp_model("m") >>> print("tokenized sentence") >>> print(tokenizer.tokenize_list(sentences)) [['▁', 'h', 'e', 'l', 'lo', ',', '▁i', '▁', ... >>> print("encoded sentence") >>> print([tokenizer.encode(sentence) for sentence in sentences]) [[5, 31, 9, 22, 19, 25, 12, 5, 13, 6, 10, ... >>> assert pretokens == [tokenizer.encode(sentence) for sentence in sentences]
-
decode
(tokens: List[int])[source]¶ decode a sentence
- Parameters
tokens (List[int]) – tokens
- Returns
a sentence
- Return type
str
Example
>>> tokenize.tokenizer([2, 3, 1, 4]) "おはようございます。"
-
encode
(sentence: str)[source]¶ encode a sentence
- Parameters
sentnece (str) – a sentence
- Returns
tokens
- Return type
List[int]
Example
>>> tokenize.tokenizer("おはようございます。") [2, 3, 1, 4]
-
languages
= ['ja', 'en', 'de']¶
-
tokenize
(sentence: str)[source]¶ tokenize a sentence
- Parameters
sentence (str) – a sentence
- Returns
words
- Return type
List[str]
Example
>>> tokenizer.tokenize("おはようございます。") ["おはよう", "ござい", "ます", "。"]
-
tokenize_list
(sentences: List[str])[source]¶ tokenize list of sentence :param sentences: sentence list :type sentences: List[str]
- Returns
list of listed words
- Return type
List[List[str]]
Examples
>>> tokenizer.tokenize(["おはようございます。"]) [["おはよう", "ござい", "ます", "。"]]
-
train_sp
(resource_flile: str, model_prefix: str = 'm', vocab_size: int = 8000, character_coverage: float = 0.995, model_type='unigram', user_defined_symbols: Tuple[str] = '<mask>')[source]¶ train sentencepiece model
- Parameters
resource_file (str) – file for training sentencepiece
vocab_size (int) – vocaburary size e.g. 8000, 16000
character_coverage (float) – character coverage [0, 1] default 0.995
model_type (str) – [‘unigram’, ‘char’, ‘bpe’, ‘word’] ref. https://github.com/google/sentencepiece
user_defined_symbols (List[str]) – special tokens such as “<mask>”
Note
resource_file’s is “sentence per line” pre_defined_symbols : <UNK>, <s>, </s>, <pad>
-