keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Add an example to MLMMaskGenerator's docstring to show the best practice

Open chenmoneygithub opened this issue 2 years ago • 6 comments

MLMMaskGenerator is a helpful tool to generate masks for input sequences. The current docstring only has two basic examples, and it would be better for users if we can provide an example of how it should be used in a real MLM workflow.

Basically the example will have:

  1. Prepare data, we can generate random strings or load some small datasets from TFDS (tensorflow datasets).
  2. Instantiate a tokenizer, and call tokenize on the data.
  3. Apply MLMMaskGenerator on the tokenized data.
  4. Feed the masked data to a dummy MLM model by calling model.fit().

chenmoneygithub avatar Apr 15 '22 20:04 chenmoneygithub

I would like to take this issue. Please assignee me!

adhadse avatar Apr 16 '22 04:04 adhadse

Awesome, thanks!

chenmoneygithub avatar Apr 16 '22 04:04 chenmoneygithub

@chenmoneygithub Sorry for the late PR, got stuck in college work. I am stuck on dummy MLM model. Can you please help me out.

adhadse avatar Apr 28 '22 11:04 adhadse

@adhadse Definitely! What is the problem you see?

chenmoneygithub avatar Apr 29 '22 00:04 chenmoneygithub

The problem I am facing is most MLM model implementations I have come to see are on BERT, so what are we referring to dummy MLM model?

  • Just input and output with no internal working?
  • If it is for demo purpose should it be functional or class based model?

adhadse avatar Apr 29 '22 03:04 adhadse

Got it. Basically here the purpose is to showcase how the mask generator can fit in a workflow, so we just use any model that could run. In fact, the MLM is a bit complex because it requires some gathering. We can either use 1) pseudo-code that briefly summarizes how the training part likes. 2) MLMHeader to show a real example.

chenmoneygithub avatar Apr 29 '22 05:04 chenmoneygithub

Hii @chenmoneygithub, @mattdangerw this is a short and concise example to demonstrate the use of MaskedLMMaskGenerator in a real MLM workflow.

max_value_1=17
max_value_2=25
OOV_TOKEN = "<UNK>"

Creating random data and vocabulary

train_data=tf.strings.as_string(tf.random.uniform(shape=[3,5],minval=1, maxval=max_value_1, dtype=tf.int64))
test_data=tf.strings.as_string(tf.random.uniform(shape=[3,5],minval=1, maxval=max_value_2, dtype=tf.int64))
data = tf.concat([train_data, test_data], 0)

vocabulary={str(i):i for i in range(1,max_value_1+1)}
vocabulary[OOV_TOKEN] = 0
vocab_size = len(vocabulary)

Instantiating tokenizer

word_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
             vocabulary=vocabulary.keys(),
             oov_token=OOV_TOKEN,
             lowercase=False, strip_accents=False, split=False
         )

Using MaskedLMMaskGenerator

masker = keras_nlp.layers.MaskedLMMaskGenerator(
    vocabulary_size=vocab_size, mask_selection_rate=0.5, mask_token_id=0,
    mask_selection_length=1)

masked_output = masker(word_tokenizer(data))

Passing the masked outputs to MaskedLMHead layer

encoded_tokens = tf.random.normal([6, 5, 5]) #generating random encodings for data
mask_preds = keras_nlp.layers.MaskedLMHead(vocabulary_size=18,activation="softmax",)(encoded_tokens, mask_positions=masked_output['mask_positions'])
keras.losses.sparse_categorical_crossentropy(masked_output['mask_ids'], mask_preds)

Let me know the required changes if any!

prajakta-1527 avatar Mar 14 '23 11:03 prajakta-1527

I can take this.

abuelnasr0 avatar Mar 24 '23 10:03 abuelnasr0