Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

Hi,

When I run Mini-InternVL2-DA-Medical-4B and 2B model, I will hit the following issue, seems like cannot find the vocab file. could you kindly help have a look? Thanks!

Reproduction

I copy script in this website https://huggingface.co/OpenGVLab/Mini-InternVL2-4B-DA-Medical and run

Environment

pip install -r requirements.txt provided by the github repo.

Error traceback

Traceback (most recent call last):
  File "/home/yucheng/code/InternVL/test_4B_DA_Medical.py", line 91, in <module>
    tokenizer = LlamaTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
  File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
    return cls._from_pretrained(
  File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 178, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/yucheng/miniconda/envs/internvl/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 206, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType

Mar 05 '25 03:03 JadeCityYC

Hi,

Please try

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

Mar 06 '25 07:03 yuecao0119

Hi,

Please try

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

Hi, in the script for the medical-4B model, this line in the script is the same with one you provided. I hit the issue as well.

If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.

path = 'OpenGVLab/Mini-InternVL2-4B-DA-Medical' model = AutoModel.from_pretrained( path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda() tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

Mar 06 '25 15:03 JadeCityYC

Is your model directory a local directory? You can check whether the file is downloaded completely.

Mar 11 '25 05:03 yuecao0119

@chenyucheng0221 @yuecao0119 I am also facing the same issue with the 'OpenGVLab/Mini-InternVL2-4B-DA-BDD'. I tried this tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) but no luck.

Thanks

Mar 19 '25 07:03 rishabh-akridata

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) , in this line, you can revise as tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) , it is useful for me when using InternVL2-4B-medical in Windows10 environment.

Mar 23 '25 12:03 Zzz0251

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False) , in this line, you can revise as tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True) , it is useful for me when using InternVL2-4B-medical in Windows10 environment.

You are right. InternVL2-4B-medical requires a fast tokenizer.

Apr 12 '25 08:04 yuecao0119

[Bug] TypeError: expected str, bytes or os.PathLike object, not NoneType

Checklist

Describe the bug

Reproduction

Environment

Error traceback

If you want to load a model using multiple GPUs, please refer to the Multiple GPUs section.

If you want to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.