ProteinNPT icon indicating copy to clipboard operation
ProteinNPT copied to clipboard

Error when running zero_shot_fitness_tranception.py: No such file or directory

Open BoyangLi-NKU opened this issue 9 months ago • 1 comments

Hi @pascalnotin I encountered an error while running the zero_shot_fitness_tranception.py through zero_shot_fitness_subs.sh:

Traceback (most recent call last):
  File "home/ProteinNPT-master/scripts/zero_shot_fitness_tranception.py", line 116, in <module>
    main()
  File "home/ProteinNPT-master/scripts/zero_shot_fitness_tranception.py", line 46, in main
    tokenizer = get_tranception_tokenizer()
  File "/root/miniconda3/envs/proteinnpt_env/lib/python3.10/site-packages/proteinnpt/utils/tranception/model_pytorch.py", line 915, in get_tranception_tokenizer
    tokenizer = PreTrainedTokenizerFast(
  File "/root/miniconda3/envs/proteinnpt_env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)

The get_tranception_tokenizer function is:

def get_tranception_tokenizer(tokenizer_path=None):
    #Tranception Alphabet: "vocab":{"[UNK]":0,"[CLS]":1,"[SEP]":2,"[PAD]":3,"[MASK]":4,"A":5,"C":6,"D":7,"E":8,"F":9,"G":10,"H":11,"I":12,"K":13,"L":14,"M":15,"N":16,"P":17,"Q":18,"R":19,"S":20,"T":21,"V":22,"W":23,"Y":24}
    if tokenizer_path is None:
        dir_path = os.path.dirname(os.path.abspath(__file__))
        tokenizer_path = os.path.join(dir_path, "utils", "tokenizers", "Basic_tokenizer")

    print(tokenizer_path)

    tokenizer = PreTrainedTokenizerFast(
        tokenizer_file=tokenizer_path, 
        unk_token="[UNK]", 
        sep_token="[SEP]", 
        pad_token="[PAD]", 
        cls_token="[CLS]",
        mask_token="[MASK]"
    )
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    tokenizer.tok_to_idx = tokenizer.vocab
    tokenizer.padding_idx = tokenizer.tok_to_idx["[PAD]"]
    tokenizer.mask_idx = tokenizer.tok_to_idx["[MASK]"]
    tokenizer.cls_idx = tokenizer.tok_to_idx["[CLS]"]
    tokenizer.eos_idx = tokenizer.tok_to_idx["[SEP]"]
    tokenizer.prepend_bos = True
    tokenizer.append_eos = True
    return tokenizer

The print statement which I added prints /root/miniconda3/envs/proteinnpt_env/lib/python3.10/site-packages/proteinnpt/utils/tranception/utils/tokenizers/Basic_tokenizer, while it does not exist in my environment.

Any suggestions for me or if there are any settings I need to correct? Thanks!

BoyangLi-NKU avatar May 10 '24 09:05 BoyangLi-NKU