keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

tokenize two text sequence

Open pure-rgb opened this issue 1 year ago • 1 comments
trafficstars

Is your feature request related to a problem? Please describe.

I am not sure if it is possible with the api. I like to follow this example exactly in keras-nlp, which written in hf-transformer. Here we tokenize question and context as follows

tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

Is it possible to tokenize question and context as above in keras-nlp. Otherwise what can be done instead to get the same effect?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

pure-rgb avatar Jan 21 '24 20:01 pure-rgb

cc. @mattdangerw Mentioning matt who is one of the author of this example and probably right person to ask.

pure-rgb avatar Jan 21 '24 20:01 pure-rgb

Sorry I missed this. But for posterity, yes this is quite supported.

The easier way to do this for our pre-trained modeling flows is to just pass multiple sequences to a classifier object. https://keras.io/examples/nlp/semantic_similarity_with_keras_nlp/ shows and example.

If you want to do you own preprocessing, you can combine any Tokenizer with a MultiSegmentPacker which will combine two sequences into a single sequence.

Or even more customizable, write your own preprocessing function on top of a Tokenizer with tf.data.

mattdangerw avatar Mar 29 '24 19:03 mattdangerw