Vocab.txt for ONNX?

Open chuckbeasley opened this issue 1 year ago • 2 comments

I want to try this out with ONNX and I'm not using Python, using .Net 9. How can I get the vocab.txt file because it is larger than the current one I'm using for BERT tokenization?

Dec 24 '24 15:12 chuckbeasley

Was searching for this as well.

I haven't yet tested this, but it seems that you may be able to extract it from the transformers package.

from transformers import AutoTokenizer

def create_vocab_file():
    tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
    
    vocab = tokenizer.get_vocab()
    
    sorted_tokens = sorted(vocab.items(), key=lambda x: x[1])
    
    with open('vocab.txt', 'w', encoding='utf-8') as f:
        for token, _ in sorted_tokens:
            f.write(f"{token}\n")

if __name__ == "__main__":
    create_vocab_file()

Jan 13 '25 19:01 dkapellusch

Did this work? I'm not familiar with Python and I'm working with .Net.

Jan 27 '25 22:01 chuckbeasley