meguru_tokenizer.process.noise_tf module¶
-
class
meguru_tokenizer.process.noise_tf.
Noiser
(vocab: meguru_tokenizer.vocab.BaseVocab)[source]¶ Bases:
object
Noising per tokenized sentnece
Note
x is the np.array whose shape is [|S|,] |S| is a variable sequence length
-
noisy
(x: List[int], drop_prob: float, blank_prob: float, sub_prob: float, shuffle_dist: float)[source]¶ add noise
- Parameters
x (List[int]) – list of word-index without any extra token (unk is ok)
drop_prob (float) – drop rate [0, 1)
blank_prob (float) – blank rate [0, 1)
sub_prob (float) – substitute rate [0, 1)
shuffle_dist (float) – shuffle rate [0, inf)
- Returns
[None, ] noised x
- Return type
np.ndarray
-
word_blank
(x: numpy.ndarray, p: float)[source]¶ blank words with probability p
- Parameters
x (np.ndarray) – [None, ] encoded sentence
p (float) –
blank rate
if 1 - p << 0, drop rate is highif p << 0, drop rate is low
- Returns
[None, ] blank array
- Return type
np.ndarray
-
word_drop
(x: numpy.ndarray, p: float)[source]¶ drop words with probability p
- Parameters
x (np.ndarray) – [None, ] encoded sentence
p (float) –
drop rate
if 1 - p << 0, drop rate is highif p << 0, drop rate is low
- Returns
drop array [None, ]
- Return type
np.ndarray
-