djl
djl copied to clipboard
Batch encoding text pairs in HuggingFaceTokenizer.
Description
Currently, the HuggingFaceTokenizer.batchEncode
only supports batch encoding of arrays/lists of single text inputs, while the text pair inputs are only supported by encode(String text, String textPair)
. To run batch-inference on a set of text pairs their encodings should be padded up to the same size.
Currently, the "out-of-the-box" solution is to create a tokenizer with padding set to the .optPadToMaxLength()
, as discussed in https://github.com/deepjavalibrary/djl/issues/1996#issuecomment-1243256576 . This will result in too high of inference latency for the input text pair batches where the longest inputs can be tokenized to vectors that are notably shorter than maxLength
.
An implementation of batchEncdode
for text pairs (for example batchEncode(QAInput[] inputs)
or batchEncode(String[] texts, String[] textPairs)
would resolve this issue when called on a tokenizer creating with padding == PaddingStrategy.LONGEST
.
This would be beneficial for the DJL users who want to optimize the latency of QA system using batching of the inputs, which is efficient for processing a big list of short Question/Context pairs on GPUs.
References
- This option is implemented in the python interface of HuggingFace
PreTrainedTokenizerFast
: https://github.com/huggingface/transformers/blob/470799b3a67c6e078b9cb3a38dc0395a70e1a4a2/src/transformers/tokenization_utils.py#L671)