ModernBERT
ModernBERT copied to clipboard
Vocab.txt for ONNX?
I want to try this out with ONNX and I'm not using Python, using .Net 9. How can I get the vocab.txt file because it is larger than the current one I'm using for BERT tokenization?
Was searching for this as well.
I haven't yet tested this, but it seems that you may be able to extract it from the transformers package.
from transformers import AutoTokenizer
def create_vocab_file():
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
vocab = tokenizer.get_vocab()
sorted_tokens = sorted(vocab.items(), key=lambda x: x[1])
with open('vocab.txt', 'w', encoding='utf-8') as f:
for token, _ in sorted_tokens:
f.write(f"{token}\n")
if __name__ == "__main__":
create_vocab_file()
Did this work? I'm not familiar with Python and I'm working with .Net.