DNABERT
DNABERT copied to clipboard
Can you use the pre-trained BERT models, but add novel tokens to the vocabulary?
Can you use the pre-trained BERT models, but add novel tokens to the vocabulary during fine-tuning? Any tips on what's needed for this?
Or during fine-tuning MUST you use the same vocab.txt file that was used in pre-training?
I want to add some of the IUPAC symbols, for example the symbol Y which means "T or C". So that will expand my vocabulary a lot.
But I don't have the resources to retrain.
Related, but I believe talking about training from scratch: https://github.com/jerryji1993/DNABERT/issues/81