[Bug] TypeError: expected str, bytes or os.PathLike object, not NoneType
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
Hi,
When I run Mini-InternVL2-DA-Medical-4B and 2B model, I will hit the following issue, seems like cannot find the vocab file. could you kindly help have a look? Thanks!
Reproduction
I copy script in this website https://huggingface.co/OpenGVLab/Mini-InternVL2-4B-DA-Medical and run
Environment
pip install -r requirements.txt provided by the github repo.
Error traceback
Traceback (most recent call last):
File "/home/yucheng/code/InternVL/test_4B_DA_Medical.py", line 91, in <module>
tokenizer = LlamaTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
return cls._from_pretrained(
File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 178, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 206, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
Hi,
Please try
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
Hi,
Please try
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
Hi, in the script for the medical-4B model, this line in the script is the same with one you provided. I hit the issue as well.
If you want to load a model using multiple GPUs, please refer to the Multiple GPUs section.
path = 'OpenGVLab/Mini-InternVL2-4B-DA-Medical' model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
Is your model directory a local directory? You can check whether the file is downloaded completely.
@chenyucheng0221 @yuecao0119 I am also facing the same issue with the 'OpenGVLab/Mini-InternVL2-4B-DA-BDD'. I tried this tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) but no luck.
Thanks
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) , in this line, you can revise as tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) , it is useful for me when using InternVL2-4B-medical in Windows10 environment.
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) , in this line, you can revise as tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) , it is useful for me when using InternVL2-4B-medical in Windows10 environment.
You are right. InternVL2-4B-medical requires a fast tokenizer.