text-generation-inference Safe Tensor converting fails for LLaMa 13B and 30B

System Info

2023-06-15T04:27:53.010592Z  INFO text_generation_launcher: Runtime environment:                                                                                                                        [30/661]
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: 5ce89059f8149eaf313c63e9ded4199670cd74bb
Docker label: sha-5ce8905
nvidia-smi:
Thu Jun 15 04:27:51 2023
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 515.65.07    Driver Version: 515.65.07    CUDA Version: 11.8     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-SXM...  On   | 00000000:10:00.0 Off |                  Off |
   | N/A   34C    P0    86W / 400W |  25302MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   1  NVIDIA A100-SXM...  On   | 00000000:16:00.0 Off |                  Off |
   | N/A   30C    P0    64W / 400W |      0MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   2  NVIDIA A100-SXM...  On   | 00000000:49:00.0 Off |                  Off |
   | N/A   31C    P0    73W / 400W |      0MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   3  NVIDIA A100-SXM...  On   | 00000000:4D:00.0 Off |                  Off |
   | N/A   31C    P0    71W / 400W |      0MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   4  NVIDIA A100-SXM...  On   | 00000000:C5:00.0 Off |                  Off |
   | N/A   34C    P0    91W / 400W |  32900MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   5  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                  Off |
   | N/A   34C    P0    92W / 400W |  33044MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   6  NVIDIA A100-SXM...  On   | 00000000:E3:00.0 Off |                  Off |
   | N/A   33C    P0    96W / 400W |  33044MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   7  NVIDIA A100-SXM...  On   | 00000000:E7:00.0 Off |                  Off |
   | N/A   35C    P0    89W / 400W |  32900MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+

   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

I used the official script to convert llama weights 7B~65B and get the following error when launching 13B and 30B only

I ran this command CUDA_VISIBLE_DEVICES=2,3 TRANSFORMERS_CACHE=~/ckpts/cache/ FLASH_ATTENTION=1 text-generation-launcher --model-id ~/ckpts/llama-hf/30b/ --num-shard 2 --port XXXXX

2023-06-15T02:42:46.121030Z  INFO text_generation_launcher: Args { model_id: "~/ckpts/llama-hf/30b/", revision: None, sharded: None, num_shard: Some(2), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 10208, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-15T02:42:46.121084Z  INFO text_generation_launcher: Sharding model on 2 processes
2023-06-15T02:42:46.121251Z  INFO text_generation_launcher: Starting download process.
2023-06-15T02:42:49.059320Z  WARN download: text_generation_launcher: No safetensors weights found for model ~/ckpts/llama-hf/30b/ at revision None. Converting PyTorch weights to safetensors.

2023-06-15T02:42:49.059564Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00001-of-00007.bin to ~/ckpts/llama-hf/30b/model-00001-of
-00007.safetensors.

2023-06-15T02:46:40.369256Z  INFO download: text_generation_launcher: Convert: [1/7] -- Took: 0:03:51.309267

2023-06-15T02:46:40.369344Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00002-of-00007.bin to ~/ckpts/llama-hf/30b/model-00002-of-00007.safetensors.

2023-06-15T02:50:24.190971Z  INFO download: text_generation_launcher: Convert: [2/7] -- Took: 0:03:43.820986

2023-06-15T02:50:24.191183Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00003-of-00007.bin to ~/ckpts/llama-hf/30b/model-00003-of-00007.safetensors.

2023-06-15T02:54:06.621353Z  INFO download: text_generation_launcher: Convert: [3/7] -- Took: 0:03:42.429557

2023-06-15T02:54:06.621511Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00004-of-00007.bin to ~/ckpts/llama-hf/30b/model-00004-of
-00007.safetensors.

11Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00004-of-00007.bin to ~/ckpts/llama-hf/30b/model-00004-of-00007.safetensors.

2023-06-15T02:57:45.265631Z  INFO download: text_generation_launcher: Convert: [4/7] -- Took: 0:03:38.643727

2023-06-15T02:57:45.265740Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00005-of-00007.bin to ~/ckpts/llama-hf/30b/model-00005-of-00007.safetensors.

2023-06-15T03:01:38.281861Z  INFO download: text_generation_launcher: Convert: [5/7] -- Took: 0:03:53.015641

2023-06-15T03:01:38.281982Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00006-of-00007.bin to ~/ckpts/llama-hf/30b/model-00006-of-00007.safetensors.

2023-06-15T03:03:43.722396Z  INFO download: text_generation_launcher: Convert: [6/7] -- Took: 0:02:05.440018

2023-06-15T03:03:43.722865Z  INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00007-of-00007.bin to ~/ckpts/llama-hf/30b/model-00007-of-00007.safetensors.

2023-06-15T03:04:36.163221Z ERROR text_generation_launcher: Download encountered an error: Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 151, in download_weights
    utils.convert_files(local_pt_files, local_st_files)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in convert_files
    convert_file(pt_file, sf_file)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 65, in convert_file
    check_file_size(pt_file, sf_file)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 20, in check_file_size
    raise RuntimeError(

RuntimeError: The file size different is more than 1%:
         - ~/ckpts/llama-hf/30b/pytorch_model-00007-of-00007.bin: 5900895281
         - ~/ckpts/llama-hf/30b/model-00007-of-00007.safetensors: 5687891896



Error: DownloadError

Expected behavior

The launch works for LLaMa 7B and 65B well so its confusing why it wont work for 13B and 30B

Jun 15 '23 04:06 jshin49

Facing the same issue as of yesterday.

Jun 15 '23 09:06 kjanko

Hmm the check_file_size is pretty rough sanitation, the file might actually be OK but it's hard to tell without looking at the file.

You can try deactivating the check ? Remove the line 20 ?

Jun 15 '23 11:06 Narsil

Can you point to the actual model on the hub you're using too ? Because we don't have any issue with "official" checkpoints

Jun 15 '23 11:06 Narsil

In my case, we're using https://huggingface.co/ausboss/llama-30b-supercot Initially we had the issue with https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor however with a restart it was working miraculously..

Jun 15 '23 12:06 kjanko

Line 65 is the line that actually needs to be removed in order to skip the check for the safetensor file size. And here is a temporary patch for Docker for anyone that wants to disable it:

Dockerfile:

FROM ghcr.io/huggingface/text-generation-inference:0.8.2

RUN sed '65d' /opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py | tee /opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py

Jun 15 '23 12:06 martinkozle

@Narsil

I'm using the official LLaMa checkpoints that I converted using the official script. The docker image I'm using is https://ghcr.io/huggingface/text-generation-inference:latest and here's the environment I used for converting the llama ckpts:

python 3.10
torch 2.1
transformers 4.30.1
accelerate 0.20.3
sentencepiece 0.1.99
protobuf 3.20.0

The funny thing is the code works well with 7B and 65B models but fails for 13B and 30B

Jun 15 '23 13:06 jshin49

Hmm the check_file_size is pretty rough sanitation, the file might actually be OK but it's hard to tell without looking at the file.

You can try deactivating the check ? Remove the line 20 ?

Throws a InvalidHeaderDeserialization. Loading the same files using LlamaForCausalLM works fine in a notebook.

Jun 15 '23 13:06 kjanko

The funny thing is the code works well with 7B and 65B models but fails for 13B and 30B

I converted 30B like 3 times today for quantization purposes without a hitch.. (I'm not on docker though)

Jun 15 '23 16:06 Narsil

Throws a InvalidHeaderDeserialization. Loading the same files using LlamaForCausalLM works fine in a notebook.

You're trying to load pickle files (which would work for LLamaForCausalLM) it seems

Jun 15 '23 17:06 Narsil

To reproduce the safetensors error with supercot:

git lfs install
git clone https://huggingface.co/ausboss/llama-30b-supercot

 

docker run -v ./llama-30b-supercot:/usr/src/llama-30b-supercot --gpus all --rm ghcr.io/huggingface/text-generation-inference:sha-5ce8905 --model-id "/usr/src/llama-30b-supercot" --quantize bitsandbytes --trust-remote-code

Note that the same behavior happens with the current latest tag as well as 0.8.2. It always fails for us on the first tensor, which is the smallest and is a pickle.

Jun 16 '23 10:06 kjanko

Interesting news. When I use a custom kernel (one that I built on my own instead of using the ghcr ones) I get the same sanity check error (which is obvious). However, in my custom-built image, I could easily disable L65, and got no error in loading the model.

Jun 19 '23 08:06 jshin49

We just tried the latest docker image on the llama-30b-supercot model and we still get this error on the very first bin, stopping the conversion:

2023-06-19T08:20:19.093178Z  INFO download: text_generation_launcher: Convert /usr/src/llama-30b-supercot/pytorch_model-00001-of-00243.bin to /usr/src/llama-30b-supercot/model-00001-of-00243.safetensors.

2023-06-19T08:20:19.513324Z ERROR text_generation_launcher: Download encountered an error: Traceback (most recent call last):
Error: DownloadError

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 151, in download_weights
    utils.convert_files(local_pt_files, local_st_files)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in convert_files
    convert_file(pt_file, sf_file)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 65, in convert_file
    check_file_size(pt_file, sf_file)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 20, in check_file_size
    raise RuntimeError(

RuntimeError: The file size different is more than 5%:
         - /usr/src/llama-30b-supercot/pytorch_model-00001-of-00243.bin: 537
         - /usr/src/llama-30b-supercot/model-00001-of-00243.safetensors: 48

After applying the patch to remove L65 and skip the difference check, the safetensors get created but then this error occurs when the model is loading:

2023-06-19T09:12:29.311089Z  INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-19T09:12:37.267433Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 237, in get_model
    return FlashLlamaSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 185, in __init__
    self.load_weights(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 216, in load_weights
    with safe_open(
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
 rank=0
2023-06-19T09:12:37.267510Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 237, in get_model
    return FlashLlamaSharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 185, in __init__
    self.load_weights(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 216, in load_weights
    with safe_open(
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
 rank=1
2023-06-19T09:12:38.394993Z ERROR text_generation_launcher: Shard 0 failed to start:
Error: ShardCannotStart
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 237, in get_model
    return FlashLlamaSharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 185, in __init__
    self.load_weights(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 216, in load_weights
    with safe_open(

safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization

This behavior is the same as we had with the previous docker images. Any idea what it could be and what we can try?

Jun 19 '23 11:06 martinkozle

Can you clean the cache and re-try? Maybe the file was corrupted.

Jun 19 '23 13:06 OlivierDehaene

/usr/src/llama-30b-supercot/pytorch_model-00001-of-00243.bin: 537 - /usr/src/llama-30b-supercot/model-00001-of-00243.safetensors: 48

This is probably an empty file or containing a super small tensor...

Jun 19 '23 13:06 Narsil

It's empty actually...

Jun 19 '23 13:06 Narsil

We managed to get llama-30b-supercot to work, here are some key findings if they are useful:

Contents of pytorch_model-00001-of-00243.bin file (of course it's a binary file, but still):

PK%=pytorch_model-00001-of-00243/data.pklFB9ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ�}q.P��PK$(pytorch_model-00001-of-00243/versionFB$ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ3
PўgUPK��%pytorch_model-00001-of-00243/data.pklPKўgU$�pytorch_model-00001-of-00243/versionPK,-�PK�PK�

Contents of model-00001-of-00243.safetensors which we force generate by disabling the file size difference check:

({},"__metadata__":{"format":"pt"}}

Torch load of the bin file gives an empty dict:

>>> torch.load("./pytorch_model-00001-of-00243.bin")
{}

The fix was just removing the first safetensor and restarting text-generation-inference:

mv model-00001-of-00243.safetensors model-00001-of-00243.safetensors.old

The model now loads and gives an expected response:

In [2]: from langchain import HuggingFaceTextGenInference

In [3]: llm = HuggingFaceTextGenInference(
   ...:     inference_server_url='http://localhost:8080/',
   ...:     max_new_tokens=512,
   ...:     top_k=50,
   ...:     top_p=0.95,
   ...:     temperature=0.01,
   ...:     repetition_penalty=1.2,
   ...: )

In [4]: llm("State what large language model you are and what you are good at:\n")
Out[4]: 'I am a large language model, which means I can understand complex sentences. This makes me great for tasks that require natural language processing such as text summarization or machine translation.'

Our theory is that the first bin file might be a metadata file from LoRA that is causing problems here?

Jun 19 '23 14:06 martinkozle

No no, I inspected the first file, it's purely empty. My paranoid flag almost expected a payload here, but no it's just empty.

Jun 20 '23 07:06 Narsil

Running into the same issue with Vicuna 13B which is also technically Llama 13B Tried these two:

TheBloke/vicuna-13B-1.1-HF
eachadea/vicuna-13b-1.1

 RuntimeError: The file size different is more than 1%:
         - /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/pytorch_model-00003-of-00003.bin: 6506663689
         - /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/model-00003-of-00003.safetensors: 6178962272

Jun 20 '23 15:06 chiragjn

Running into the same issue with Vicuna 13B which is also technically Llama 13B Tried these two:

* `TheBloke/vicuna-13B-1.1-HF`

* `eachadea/vicuna-13b-1.1`

 RuntimeError: The file size different is more than 1%:
         - /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/pytorch_model-00003-of-00003.bin: 6506663689
         - /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/model-00003-of-00003.safetensors: 6178962272

With this commit the threshold was increased to 5%. But manually calculating the difference with your files it is ~5.3%, so that won't fix it. To disable the check and to force it to generate the safetensors try removing line 65 in server/text_generation_server/utils/convert.py if you are running it directly or if you are using Docker then you can use the Dockerfile that I posted above to extend and patch the image.

Jun 22 '23 07:06 martinkozle

text-generation-inference text-generation-inference copied to clipboard

Safe Tensor converting fails for LLaMa 13B and 30B

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard