keras-nlp
keras-nlp copied to clipboard
tokenize two text sequence
Is your feature request related to a problem? Please describe.
I am not sure if it is possible with the api. I like to follow this example exactly in keras-nlp, which written in hf-transformer. Here we tokenize question and context as follows
tokenized_examples = tokenizer(
examples["question"],
examples["context"],
truncation="only_second",
max_length=max_length,
stride=doc_stride,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding="max_length",
)
Is it possible to tokenize question and context as above in keras-nlp. Otherwise what can be done instead to get the same effect?
Describe the solution you'd like
Describe alternatives you've considered
Additional context
cc. @mattdangerw Mentioning matt who is one of the author of this example and probably right person to ask.
Sorry I missed this. But for posterity, yes this is quite supported.
The easier way to do this for our pre-trained modeling flows is to just pass multiple sequences to a classifier object. https://keras.io/examples/nlp/semantic_similarity_with_keras_nlp/ shows and example.
If you want to do you own preprocessing, you can combine any Tokenizer with a MultiSegmentPacker which will combine two sequences into a single sequence.
Or even more customizable, write your own preprocessing function on top of a Tokenizer with tf.data.