meguru_tokenizer.process.noise_tf module¶

class meguru_tokenizer.process.noise_tf.Noiser(vocab: meguru_tokenizer.vocab.BaseVocab)[source]¶

Bases: object

Noising per tokenized sentnece

Note

x is the np.array whose shape is [|S|,] |S| is a variable sequence length

noisy(x: List[int], drop_prob: float, blank_prob: float, sub_prob: float, shuffle_dist: float)[source]¶

add noise

Parameters

Returns

[None, ] noised x

Return type

np.ndarray

word_blank(x: numpy.ndarray, p: float)[source]¶

blank words with probability p

Parameters

Returns

[None, ] blank array

Return type

np.ndarray

word_drop(x: numpy.ndarray, p: float)[source]¶

drop words with probability p

Parameters

Returns

drop array [None, ]

Return type

np.ndarray

word_shuffle(x: numpy.ndarray, k: float)[source]¶

slight shuffle such that |sigma[i]-i| <= k

Parameters

Returns

shuffled array [None, ]

Return type

np.ndarray

word_substitute(x: numpy.ndarray, p: float)[source]¶

substitute words with probability p

Parameters

Returns

[None, ] substitute array

Return type

np.ndarray