keras-nlp
keras-nlp copied to clipboard
Add an example to MLMMaskGenerator's docstring to show the best practice
MLMMaskGenerator is a helpful tool to generate masks for input sequences. The current docstring only has two basic examples, and it would be better for users if we can provide an example of how it should be used in a real MLM workflow.
Basically the example will have:
- Prepare data, we can generate random strings or load some small datasets from TFDS (tensorflow datasets).
- Instantiate a tokenizer, and call tokenize on the data.
- Apply
MLMMaskGenerator
on the tokenized data. - Feed the masked data to a dummy MLM model by calling
model.fit()
.
I would like to take this issue. Please assignee me!
Awesome, thanks!
@chenmoneygithub Sorry for the late PR, got stuck in college work. I am stuck on dummy MLM model. Can you please help me out.
@adhadse Definitely! What is the problem you see?
The problem I am facing is most MLM model implementations I have come to see are on BERT, so what are we referring to dummy MLM model?
- Just input and output with no internal working?
- If it is for demo purpose should it be functional or class based model?
Got it. Basically here the purpose is to showcase how the mask generator can fit in a workflow, so we just use any model that could run. In fact, the MLM is a bit complex because it requires some gathering. We can either use 1) pseudo-code that briefly summarizes how the training part likes. 2) MLMHeader to show a real example.
Hii @chenmoneygithub, @mattdangerw this is a short and concise example to demonstrate the use of MaskedLMMaskGenerator in a real MLM workflow.
max_value_1=17
max_value_2=25
OOV_TOKEN = "<UNK>"
Creating random data and vocabulary
train_data=tf.strings.as_string(tf.random.uniform(shape=[3,5],minval=1, maxval=max_value_1, dtype=tf.int64))
test_data=tf.strings.as_string(tf.random.uniform(shape=[3,5],minval=1, maxval=max_value_2, dtype=tf.int64))
data = tf.concat([train_data, test_data], 0)
vocabulary={str(i):i for i in range(1,max_value_1+1)}
vocabulary[OOV_TOKEN] = 0
vocab_size = len(vocabulary)
Instantiating tokenizer
word_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
vocabulary=vocabulary.keys(),
oov_token=OOV_TOKEN,
lowercase=False, strip_accents=False, split=False
)
Using MaskedLMMaskGenerator
masker = keras_nlp.layers.MaskedLMMaskGenerator(
vocabulary_size=vocab_size, mask_selection_rate=0.5, mask_token_id=0,
mask_selection_length=1)
masked_output = masker(word_tokenizer(data))
Passing the masked outputs to MaskedLMHead layer
encoded_tokens = tf.random.normal([6, 5, 5]) #generating random encodings for data
mask_preds = keras_nlp.layers.MaskedLMHead(vocabulary_size=18,activation="softmax",)(encoded_tokens, mask_positions=masked_output['mask_positions'])
keras.losses.sparse_categorical_crossentropy(masked_output['mask_ids'], mask_preds)
Let me know the required changes if any!
I can take this.