meguru_tokenizer.process.noise_tf module

class meguru_tokenizer.process.noise_tf.Noiser(vocab: meguru_tokenizer.vocab.BaseVocab)[source]

Bases: object

Noising per tokenized sentnece

Note

x is the np.array whose shape is [|S|,] |S| is a variable sequence length

noisy(x: List[int], drop_prob: float, blank_prob: float, sub_prob: float, shuffle_dist: float)[source]

add noise

Parameters
  • x (List[int]) – list of word-index without any extra token (unk is ok)

  • drop_prob (float) – drop rate [0, 1)

  • blank_prob (float) – blank rate [0, 1)

  • sub_prob (float) – substitute rate [0, 1)

  • shuffle_dist (float) – shuffle rate [0, inf)

Returns

[None, ] noised x

Return type

np.ndarray

word_blank(x: numpy.ndarray, p: float)[source]

blank words with probability p

Parameters
  • x (np.ndarray) – [None, ] encoded sentence

  • p (float) –

    blank rate

    if 1 - p << 0, drop rate is high
    if p << 0, drop rate is low

Returns

[None, ] blank array

Return type

np.ndarray

word_drop(x: numpy.ndarray, p: float)[source]

drop words with probability p

Parameters
  • x (np.ndarray) – [None, ] encoded sentence

  • p (float) –

    drop rate

    if 1 - p << 0, drop rate is high
    if p << 0, drop rate is low

Returns

drop array [None, ]

Return type

np.ndarray

word_shuffle(x: numpy.ndarray, k: float)[source]

slight shuffle such that |sigma[i]-i| <= k

Parameters
  • x (np.ndarray) – [None, ] encoded sentence

  • k (float) –

    shuffle probability [0, inf)

    if k << 0, shuffle frequency is low

Returns

shuffled array [None, ]

Return type

np.ndarray

word_substitute(x: numpy.ndarray, p: float)[source]

substitute words with probability p

Parameters
  • x (np.ndarray) – [None, ] encoded sentence

  • p (float) –

    substitute rate

    if 1 - p << 0, drop rate is high
    if p << 0, drop rate is low

Returns

[None, ] substitute array

Return type

np.ndarray