meguru_tokenizer.whitespace_tokenizer module¶
-
class
meguru_tokenizer.whitespace_tokenizer.
LooseWhitespaceTokenizer
(vocab: Optional[meguru_tokenizer.vocab.Vocab] = None, normalize: bool = True, lower: bool = True, language: str = 'unk')[source]¶ Bases:
meguru_tokenizer.whitespace_tokenizer.WhitespaceTokenizer
tokenizer splits by whitespace (without NLTK tokenize)
-
languages
= ['en', 'de', 'ja']¶
-
-
class
meguru_tokenizer.whitespace_tokenizer.
WhitespaceTokenizer
(vocab: Optional[meguru_tokenizer.vocab.Vocab] = None, normalize: bool = True, lower: bool = True, language: str = 'unk')[source]¶ Bases:
meguru_tokenizer.base_tokenizer.Tokenizer
tokenizer splits by whitespace
Example
>>> import pprint >>> tokenizer = WhitespaceTokenizer(lower=True, language="en") >>> sentences = [ >>> "Hello, I don't know how to use it?", >>> "Tensorflow is awesome!", >>> "it is good framework.", >>> ] >>> vocab = Vocab() >>> for sentence in sentences: vocab.add_vocabs(tokenizer.tokenize(sentence)) >>> vocab.build_vocab() >>> tokenizer.vocab = vocab >>> vocab.dump_vocab(Path("vocab.txt")) >>> print("vocabs:") >>> pprint.pprint(vocab.i2w) {0: '<pad>', 1: '<s>', 2: '</s>', 3: '<unk>', 4: '<mask>', 5: 'it', 6: 'is', 7: 'hello', 8: ',', 9: 'i', 10: 'do', 11: "n't", 12: 'know', 13: 'how', 14: 'to', 15: 'use', 16: '?', 17: 'tensorflow', 18: 'awesome', 19: '!', 20: 'good', 21: 'framework', 22: '.'} >>> print("tokenized sentence") >>> pprint.pprint(tokenizer.tokenize_list(sentences)) [['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'], ['tensorflow', 'is', 'awesome', '!'], ['it', 'is', 'good', 'framework', '.']] >>> print("encoded sentence") >>> pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])> [[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]] >>> encodes = [] >>> for sentence in sentences: >>> encodes.append(tokenizer.encode(sentence)) >>> print("decoded sentence") >>> pprint.pprint([tokenizer.decode(tokens) for tokens in encodes]) ["hello , i do n't know how to use it ?", 'tensorflow is awesome !', 'it is good framework .']
-
decode
(tokens: List[int])[source]¶ decode a sentence
- Parameters
tokens (List[int]) – tokens
- Returns
a sentence
- Return type
str
Example
>>> tokenize.tokenizer([2, 3, 1, 4]) "おはようございます。"
-
encode
(sentence: str)[source]¶ encode a sentence
- Parameters
sentnece (str) – a sentence
- Returns
tokens
- Return type
List[int]
Example
>>> tokenize.tokenizer("おはようございます。") [2, 3, 1, 4]
-
languages
= ['en', 'de']¶
-
tokenize
(sentence: str)[source]¶ tokenize a sentence
- Parameters
sentence (str) – a sentence
- Returns
words
- Return type
List[str]
Example
>>> tokenizer.tokenize("おはようございます。") ["おはよう", "ござい", "ます", "。"]
-