machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Track Tokenizers design feedback

Open tarekgh opened this issue 1 year ago • 0 comments

This issue to track investigating and address the feedback we got regarding the tokenizers design

@stephentoub Feedback

If we’re able to make such breaking changes, we should also be reconsidering other aspects of the library then I think, in particular for perf, for example:

  • Token is class, which means allocation per token, plus the design effectively forces the string Value of the Token to be materialized even if it’s never used.

  • I don’t see any way to get just a token count without materializing the list of tokens, even though just the count is a commonly needed thing in these scenarios. Presumably such an API could get away with a lot less overhead / allocation.

  • Should there be support for spans baked in?

  • [x] https://github.com/dotnet/machinelearning/pull/7024

tarekgh avatar Feb 01 '24 23:02 tarekgh