machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

[Tokenizers] Port CLIP Tokenizer

Open ericstj opened this issue 1 year ago • 1 comments

Port CLIP tokenizer which leverages byte-level BPE. This tokenizer enables scenarios like StableDiffusion

May be dependent on https://github.com/dotnet/machinelearning/issues/6992.

Reference: https://huggingface.co/docs/transformers/main/en/model_doc/clip https://github.com/huggingface/transformers/blob/0549000c5bf6c7249f411917f2a6f0b6d0f06da1/src/transformers/models/codegen/tokenization_codegen.py#L98 https://onnxruntime.ai/docs/tutorials/csharp/stable-diffusion-csharp.html#tokenization-with-onnx-runtime-extensions

Paper: https://arxiv.org/abs/2103.00020 https://arxiv.org/pdf/2103.00020.pdf

ericstj avatar Feb 08 '24 18:02 ericstj

Note - ONNX sample doesn't require separate tokenizer.

@LittleLittleCloud might need this for a solution that works with torchsharp.

ericstj avatar Mar 18 '24 20:03 ericstj