DNABERT Can you use the pre-trained BERT models, but add novel tokens to the vocabulary?

Can you use the pre-trained BERT models, but add novel tokens to the vocabulary?

Open mepster opened this issue 2 years ago • 0 comments

Can you use the pre-trained BERT models, but add novel tokens to the vocabulary during fine-tuning? Any tips on what's needed for this?

Or during fine-tuning MUST you use the same vocab.txt file that was used in pre-training?

I want to add some of the IUPAC symbols, for example the symbol Y which means "T or C". So that will expand my vocabulary a lot.

But I don't have the resources to retrain.

Related, but I believe talking about training from scratch: https://github.com/jerryji1993/DNABERT/issues/81

Feb 14 '23 23:02 mepster