DNABERT Model name 'dna6' was not found in tokenizers model name list

Hi there,

I am running the DNABERT run_finetune.py as instructed by the readme file. It works well at my workstation, but when I run the same code on the server, it reports the following error:

OSError: Model name 'dna6' was not found in tokenizers model name list (dna3, dna4, dna5, dna6). We assumed 'dna6' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

I wonder why "Model name 'dna6' was not found in tokenizers model name list (dna3, dna4, dna5, dna6)"? It seems so strange, because dna6 is definitely in the list.

Thanks for the answer!

Sep 23 '21 03:09 BioSenior

I have the exactly the same problem, runs fine on workstation but not on the server and gives the same error!

Sep 23 '21 16:09 ksenia007

I think I've figured out the issue! For me the error was message was generated here and was coming from the loading of the vocab files. By default, vocab_files are in fact links to the files and the server would not allow me to download files from running code. If you download the vocab files separately, and then provide path to the file instead of the dna6 it seems to work!

Sep 29 '21 00:09 ksenia007

It works！Thank you very much!

Sep 29 '21 12:09 BioSenior

Hello @ksenia007 and @BioSenior 👍

Thank you for sharing. I have the same error message when I ran 'python run_pretrain.py' followed in README. OSError: Model name 'dna6' was not found in tokenizers model name list (dna3, dna4, dna5, dna6).

If I search vocab file in my DNABERT directory, I have the followings: [ ~DNABERT]$ find . -name vocab* ./examples/ft/6/vocab.txt ./examples/ft/6/pre/vocab.txt ./examples/ft/6/pre_2_old/vocab.txt ./examples/ft/6-bk/vocab.txt ./src/transformers/dnabert-config/bert-config-6/vocab.txt ./src/transformers/dnabert-config/bert-config-4/vocab.txt ./src/transformers/dnabert-config/bert-config-5/vocab.txt ./src/transformers/dnabert-config/bert-config-3/vocab.txt

May you please advise what to change in the commands to go through this error?

Sep 29 '21 21:09 ryao-mdanderson

@ryao-mdanderson I am not sure if you have the same problem. However, I believe that if you specify just dna6 as a tokenizer name, it tries to load vocab.txt from these links and not access files from the source folder. For me, I downloaded the vocab.txt file into my data folder using wget, and then in tokenizer_name just passed path/to/directory/vocab.txt.

Sorry if that does not help in your case!

Sep 30 '21 00:09 ksenia007

@ksenia007 👍

Thank you very much. I got it. since I am running the code in a compute cluster node which does not have internet access, I followed your suggestion, change in tokenizer_name by passing the path/to/directory/vocab.txt. It works now.

Sep 30 '21 03:09 ryao-mdanderson

@ksenia007, @ryao-mdanderson, @jerryji1993 I am getting the same error using run_pretrain.py script and I tried the same solution but didnt work at all. The error is given below: <class 'transformers.tokenization_dna.DNATokenizer'> Traceback (most recent call last): File "examples/run_pretrain.py", line 885, in main() File "examples/run_pretrain.py", line 789, in main tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir) File "/home/smrutip/smruti/DNABERT/src/transformers/tokenization_utils.py", line 377, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File "/home/smrutip/smruti/DNABERT/src/transformers/tokenization_utils.py", line 479, in _from_pretrained list(cls.vocab_files_names.values()), OSError: Model name 'dna6' was not found in tokenizers model name list (dna3, dna4, dna5, dna6). We assumed 'dna6' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

Can you please help me regarding this?

Apr 11 '23 19:04 smruti241

DNABERT DNABERT copied to clipboard

Model name 'dna6' was not found in tokenizers model name list

DNABERT
DNABERT copied to clipboard