machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Batch Tokenization Support

Open tjwald opened this issue 10 months ago • 3 comments
trafficstars

Is your feature request related to a problem? Please describe. Most AI systems use batching for performance reasons, needing all tokenized sentences being the same size and outputting a mask of which values are padding. In my project I had to implement this myself. The issues are mostly performance and API compatibility with the ecosystem. With my solution - There are megabytes of allocations:

Image

The Int64[] allocations are due to widening needed to be done since the ONNX model needs Tensor as input. The int32[] allocations are the actual tokens. The strings allocations are token strings that are not used and are thrown away. The other allocations are internal, and I don't know what they are.

Describe the solution you'd like Enable 0 allocations solution via an API like the following:

class Tokenizer
{
     ...
     public abstract void BatchTokenize<T, K>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inoutIds, Tensor<T> inputMask) 
              where T: INumber<T>;
     
     public abstract void BatchTokenize<T>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inputIds, Tensor<T> inputMask, Tensor<T> tokenTypeIds) 
              where T: INumber<T>;
}

Maybe instead of Tensor<T> you want to use TensorSpan<T>?

Where the string allocations are removed if not needed, and the other internal allocations optimized. This API will enable me to pool Tensors, and removing casting from the int to long for my models.

Describe alternatives you've considered I have implemented my own batch tokenizer: https://github.com/tjwald/high-perf-ML/blob/develop/ML.Infra/Tokenization/PretrainedTokenizer.cs.

Additional context Continuing this ticket: https://github.com/microsoft/semantic-kernel/issues/9793 on the tokenization part.

tjwald avatar Jan 23 '25 07:01 tjwald

Other implementation from AI Dev Gallery: https://github.com/microsoft/ai-dev-gallery/blob/main/AIDevGallery/Samples/SharedCode/TokenizerExtensions.cs

luisquintanilla avatar Feb 12 '25 21:02 luisquintanilla

After inplementing other scenarions, I think that the API should get TensorSpan - it is more flexible.

tjwald avatar Mar 18 '25 16:03 tjwald

Note that the tokenizers library is supporting .NET Core and NetStandard 2.0. Tensors are not supported in NetStandard. We need to figure out another way can help doing the batching and make it easy to bridge the result to a tenso/span.

tarekgh avatar Mar 18 '25 18:03 tarekgh

@luisquintanilla @tarekgh Any updates here?

tjwald avatar Sep 06 '25 18:09 tjwald

Not yet. For now, you can continue using your own batching implementation. We’ve had to prioritize other work, which pushed this item down the list. My understanding is that this isn’t blocking you, correct?

tarekgh avatar Sep 06 '25 19:09 tarekgh

No blocker, just loosing performance :)

tjwald avatar Sep 06 '25 19:09 tjwald