text-generation-inference 0.9.4 docker image using cuda 11.7 instead of cuda 11.8

System Info

We are trying to run llama2-70B model and have noticed that with huggingface/text-generation-inference:0.9.1 docker image it is using cuda 11.8

>>> import torch
>>> torch.version.cuda
'11.8'

However with 0.9.4 image it shows cuda 11.7 (same is the case with latest)

>>> import torch
>>> torch.version.cuda
'11.7'

This we believe is causing issue with starting the inference server.

Information

[ ] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[ ] My own modifications

Reproduction

Run docker container with 0.9.4 version and then check torch cuda version.

Expected behavior

Expected behaviour is to use 11.8 cuda version rather than 11.7

Jul 31 '23 09:07 pravingadakh

I have the same issue with 1.0.0. This makes it impossible to use on H100 GPUs.

Jul 31 '23 13:07 svenschultze

So I upgraded the torch cuda version to 11.8, the python interpreter correctly shows 11.8 now.

conda install --force pytorch==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia

However that did not help and I still got following error (which I initially believed was because of cuda version)

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
    return FlashLlama(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 31, in __init__
    self.process_group, rank, world_size = initialize_torch_distributed()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/dist.py", line 54, in initialize_torch_distributed
    torch.cuda.set_per_process_memory_fraction(MEMORY_FRACTION, device)

  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/memory.py", line 118, in set_per_process_memory_fraction
    torch._C._cuda_setMemoryFraction(fraction, device)

RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Then I tried disabling custom kernels and that seem to did the trick. Although I'm not sure why that helped, maybe the issue is with flash attention?

Wanted to add that I'm running llama2-70B on 8 A100 (40GB) GPUs.

Jul 31 '23 13:07 pravingadakh

Very odd, the version is indeed 11.8 on the Dockerfile for 0.9.4 : https://github.com/huggingface/text-generation-inference/blob/v0.9.4/Dockerfile#L44

Jul 31 '23 13:07 Narsil

huggingface/text-generation-inference:0.9.1

Try using actually 0.9.4 ?

Jul 31 '23 13:07 Narsil

@Narsil I tried with 0.9.4 image, torch.version.cuda shows 11.7 only. However even after upgrading the cuda version it did not help. Disabling custom kernels helped but I would prefer not to do it. Can you help me identify what would cause the above cuda error? Is it somehow related to flash attention?

RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Jul 31 '23 14:07 pravingadakh

The error means that you're trying to load a cuda kernel that was compiled with a different version.

I'm going to try and confirm this.

Jul 31 '23 14:07 Narsil

Hmm I'm confused. I indeed see :

>>> torch.version.cuda
'11.7'

However the build script definitely says it's asking for 11.8... I'm going to stop for today, if you can can you check if newer images still have the issue (I'm kind of hoping for a transient issue with conda when we built the release):

sha-7766fee

Jul 31 '23 17:07 Narsil

Even with sha-7766fee I am seeing the same issue.

Jul 31 '23 17:07 pravingadakh

@Narsil Were you able to figure out the issue? Also I have llama2 model deployed on 8 A100 (40 GB) GPUs and had couple of quick questions around that. Can this model make use of flash attention? Is there any downside of running it with custom kernels disabled? Thanks in advance.

Aug 02 '23 18:08 pravingadakh

Just created a PR for it.

We're going to add peft dependency and others which are already depending on PyTorch. This should fix it, however I'll also incorportate your change since otherwise it's always going to be a cat and mouse game to see who's screwing up our cuda version.

Aug 03 '23 14:08 Narsil

Ok it is merged, could you try on latest (one it finishes uploading ?)

https://github.com/huggingface/text-generation-inference/actions/runs/5755083304

Edit: sha-f91e9d2

Aug 03 '23 19:08 Narsil

Did not fix for me. I still got:

RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

I'm running llama-2-13b-chat-hf on 1xA100 40GB GPU.

Aug 07 '23 20:08 marioplumbarius

We're running mostly on those.... Do you mind opening a new issue and giving all the details you can provide ?

Aug 08 '23 07:08 Narsil

@Narsil With sha-f91e9d2 image I am seeing torch cuda version as 11.8 now. However I still need to add --disable-custom-kernels in order to deploy llama model.

@marioluan You can try adding --disable-custom-kernels to text-generation-launcher command.

Aug 08 '23 08:08 pravingadakh

My bad, I was trying to deploy to a host running nvidia driver 470.182.03 and CUDA 11.4. CUDA 11.7 (and 11.8) are not compatible with that nvidia driver version. Unfortunately I can't upgrade the driver. Are there any chances to provide support for CUDA 11.4 in addition to 11.7 and 11.8?

Aug 15 '23 20:08 marioplumbarius

Sorry no the 11.4 drivers actually have some stability issues regarding BF16/F16 so I'm not sure we want to support them.

You should however be able to modify the source in order to build for 11.4 yourself (just modify Dockerfile and potentially pyproject.toml as torch is built against cu11{7,8} for >2.0 (If I'm not mistaken)

Aug 16 '23 15:08 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 16 '24 01:04 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

0.9.4 docker image using cuda 11.7 instead of cuda 11.8

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard