Converting c4ai-command-r-v01 model fails due to missing "trust_remote_code=True" in _set_vocab_gpt2

Open countzero opened this issue 1 year ago • 0 comments

Defect

In llama.cpp b2450 the package transformers 4.38.2 throws an exception when trying to convert the c4ai-command-r-v01 model.

Steps to reproduce

Download https://huggingface.co/CohereForAI/c4ai-command-r-v01
Execute the following in the llama.cpp working directory:

python .\convert-hf-to-gguf.py --outfile "./models/c4ai-command-r-v01.gguf" "path/to/repository"

Error message

Loading model: c4ai-command-r-v01
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
Traceback (most recent call last):
  File "%USERPROFILE%\miniconda3\envs\llama.cpp\lib\site-packages\transformers\dynamic_module_utils.py", line 595, in resolve_trust_remote_code
    signal.signal(signal.SIGALRM, _raise_timeout_error)
AttributeError: module 'signal' has no attribute 'SIGALRM'. Did you mean: 'SIGABRT'?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Arbeit\windows_llama.cpp\vendor\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module>
    main()
  File "D:\Arbeit\windows_llama.cpp\vendor\llama.cpp\convert-hf-to-gguf.py", line 2060, in main
    model_instance.set_vocab()
  File "D:\Arbeit\windows_llama.cpp\vendor\llama.cpp\convert-hf-to-gguf.py", line 73, in set_vocab
    self._set_vocab_gpt2()
  File "D:\Arbeit\windows_llama.cpp\vendor\llama.cpp\convert-hf-to-gguf.py", line 226, in _set_vocab_gpt2
    tokenizer = AutoTokenizer.from_pretrained(dir_model)
  File "%USERPROFILE%\miniconda3\envs\llama.cpp\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 797, in from_pretrained
    trust_remote_code = resolve_trust_remote_code(
  File "%USERPROFILE%\miniconda3\envs\llama.cpp\lib\site-packages\transformers\dynamic_module_utils.py", line 611, in resolve_trust_remote_code
    raise ValueError(
ValueError: The repository for D:\Arbeit\windows_manage_large_language_models\source\c4ai-command-r-v01 contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/D:\Arbeit\windows_manage_large_language_models\source\c4ai-command-r-v01.
Please pass the argument `trust_remote_code=True` to allow custom code to be run.

Workaround

Adding the trust_remote_code=True argument to the from_pretrained function call in https://github.com/ggerganov/llama.cpp/blob/b2450/convert-hf-to-gguf.py#L226 fixes the issue:

tokenizer = AutoTokenizer.from_pretrained(dir_model, trust_remote_code=True)

Question

Why is remote code trusted for _set_vocab_qwen, but not for _set_vocab_gpt2?

Kontext

https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForPreTraining.from_pretrained.trust_remote_code

Mar 18 '24 10:03 countzero