meguru_tokenizer.whitespace_tokenizer module

class meguru_tokenizer.whitespace_tokenizer.LooseWhitespaceTokenizer(vocab: Optional[meguru_tokenizer.vocab.Vocab] = None, normalize: bool = True, lower: bool = True, language: str = 'unk')[source]

Bases: meguru_tokenizer.whitespace_tokenizer.WhitespaceTokenizer

tokenizer splits by whitespace (without NLTK tokenize)

languages = ['en', 'de', 'ja']
tokenize(sentence: str)[source]

tokenize a sentence

Parameters

sentence (str) – a sentence

Returns

words

Return type

List[str]

Example

>>> tokenizer.tokenize("おはようございます。")
["おはよう", "ござい", "ます", "。"]
class meguru_tokenizer.whitespace_tokenizer.WhitespaceTokenizer(vocab: Optional[meguru_tokenizer.vocab.Vocab] = None, normalize: bool = True, lower: bool = True, language: str = 'unk')[source]

Bases: meguru_tokenizer.base_tokenizer.Tokenizer

tokenizer splits by whitespace

Example

>>> import pprint
>>> tokenizer = WhitespaceTokenizer(lower=True, language="en")
>>> sentences = [
>>>     "Hello, I don't know how to use it?",
>>>     "Tensorflow is awesome!",
>>>     "it is good framework.",
>>>  ]
>>> vocab = Vocab()
>>> for sentence in sentences:
        vocab.add_vocabs(tokenizer.tokenize(sentence))
>>> vocab.build_vocab()
>>> tokenizer.vocab = vocab
>>> vocab.dump_vocab(Path("vocab.txt"))
>>> print("vocabs:")
>>> pprint.pprint(vocab.i2w)
{0: '<pad>',
 1: '<s>',
 2: '</s>',
 3: '<unk>',
 4: '<mask>',
 5: 'it',
 6: 'is',
 7: 'hello',
 8: ',',
 9: 'i',
 10: 'do',
 11: "n't",
 12: 'know',
 13: 'how',
 14: 'to',
 15: 'use',
 16: '?',
 17: 'tensorflow',
 18: 'awesome',
 19: '!',
 20: 'good',
 21: 'framework',
 22: '.'}
>>> print("tokenized sentence")
>>> pprint.pprint(tokenizer.tokenize_list(sentences))
[['hello', ',', 'i', 'do', "n't", 'know', 'how', 'to', 'use', 'it', '?'],
 ['tensorflow', 'is', 'awesome', '!'],
 ['it', 'is', 'good', 'framework', '.']]
>>> print("encoded sentence")
>>> pprint.pprint([tokenizer.encode(sentence) for sentence in sentences])>
[[7, 8, 9, 10, 11, 12, 13, 14, 15, 5, 16], [17, 6, 18, 19], [5, 6, 20, 21, 22]]
>>> encodes = []
>>> for sentence in sentences:
>>>     encodes.append(tokenizer.encode(sentence))
>>> print("decoded sentence")
>>> pprint.pprint([tokenizer.decode(tokens) for tokens in encodes])
["hello , i do n't know how to use it ?",
 'tensorflow is awesome !',
 'it is good framework .']
decode(tokens: List[int])[source]

decode a sentence

Parameters

tokens (List[int]) – tokens

Returns

a sentence

Return type

str

Example

>>> tokenize.tokenizer([2, 3, 1, 4])
"おはようございます。"
encode(sentence: str)[source]

encode a sentence

Parameters

sentnece (str) – a sentence

Returns

tokens

Return type

List[int]

Example

>>> tokenize.tokenizer("おはようございます。")
[2, 3, 1, 4]
languages = ['en', 'de']
tokenize(sentence: str)[source]

tokenize a sentence

Parameters

sentence (str) – a sentence

Returns

words

Return type

List[str]

Example

>>> tokenizer.tokenize("おはようございます。")
["おはよう", "ござい", "ます", "。"]
tokenize_list(sentences: List[str])[source]

tokenize list of sentence :param sentences: sentence list :type sentences: List[str]

Returns

list of listed words

Return type

List[List[str]]

Examples

>>> tokenizer.tokenize(["おはようございます。"])
[["おはよう", "ござい", "ます", "。"]]
vocab_size()[source]

vocaburary size

Returns

vocab_size

Return type

int