machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Whisper Tokenizer support

Open MithrilMan opened this issue 11 months ago • 1 comments

Is your feature request related to a problem? Please describe. Whisper tokenizer support needed

Describe the solution you'd like Would be nice to have support for the Whisper tokenizer.

Describe alternatives you've considered I'm new to tokenizers so I'm not sure if what I'm doing right now is correct but I'm trying to use a BpeTokenizer passing vocab and merges files and the special tokens (not straightforward because for example I'm reading this file https://huggingface.co/onnx-community/whisper-large-v3-turbo/blob/main/special_tokens_map.json and I need to read vocab file too to get the max id to know where to start from to map special token to id number)

The linked repository has even a tokenizer.json that I suppose contains already everything without the need to pass vocab and merges, but I don't see a way to use it out of the box (I haven't find a constructor that accepts a tokenizer.json file)

MithrilMan avatar Dec 23 '24 20:12 MithrilMan

@luisquintanilla @tarekgh

ericstj avatar Jan 07 '25 16:01 ericstj

Here is how to load and use this tokenizer:

  • Ensure referencing the updated tokenizers library in your csproj:
  <ItemGroup>
    <PackageReference Include="Microsoft.ML.Tokenizers" Version="2.0.0-preview.25503.2" />
  </ItemGroup>
  • Then use the following code to create the tokenizer:

// Can add more tokens if your scenario needs it
Dictionary<string, int> specialTokens = new()
{
    { "<|endoftext|>",          50257 },
    { "<|startoftranscript|>",  50258 },
    { "<|notimestamps|>",       50364 }
};

// vocab.json and merges.txt are downloaded from https://huggingface.co/onnx-community/whisper-large-v3-turbo/tree/main

BpeOptions bpeOptions = new BpeOptions("vocab.json", "merges.txt")
{
    ByteLevel = true,
    ContinuingSubwordPrefix = "",
    EndOfWordSuffix = "",
    BeginningOfSentenceToken = "<|startoftranscript|>",
    EndOfSentenceToken = "<|endoftext|>",
    SpecialTokens = specialTokens,
};

BpeTokenizer tokenizer = BpeTokenizer.Create(bpeOptions);

  • Now you can call the tokenizer
string text = "Hello, World!";
var tokens = tokenizer.EncodeToTokens(text, out string? normalizedText, considerPreTokenization: false);

foreach (var token in tokens)
{
    Console.WriteLine($"Token: [{token.Id}, '{token.Value}',  ({token.Offset.Start.Value}, {token.Offset.End.Value})]");
}

This should produce output like:

Token: [50258, '<|startoftranscript|>',  (0, 0)]
Token: [15947, 'Hello',  (0, 5)]
Token: [11, ',',  (5, 6)]
Token: [3937, 'ĠWorld',  (6, 12)]
Token: [0, '!',  (12, 13)]
Token: [50257, '<|endoftext|>',  (13, 13)]

I am closing this issue, feel free to reply with any questions, thanks for your report.

tarekgh avatar Oct 04 '25 01:10 tarekgh