text-generation-inference
text-generation-inference copied to clipboard
0.9.4 docker image using cuda 11.7 instead of cuda 11.8
System Info
We are trying to run llama2-70B model and have noticed that with huggingface/text-generation-inference:0.9.1 docker image it is using cuda 11.8
>>> import torch
>>> torch.version.cuda
'11.8'
However with 0.9.4 image it shows cuda 11.7 (same is the case with latest)
>>> import torch
>>> torch.version.cuda
'11.7'
This we believe is causing issue with starting the inference server.
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
Run docker container with 0.9.4 version and then check torch cuda version.
Expected behavior
Expected behaviour is to use 11.8 cuda version rather than 11.7
I have the same issue with 1.0.0. This makes it impossible to use on H100 GPUs.
So I upgraded the torch cuda version to 11.8, the python interpreter correctly shows 11.8 now.
conda install --force pytorch==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
However that did not help and I still got following error (which I initially believed was because of cuda version)
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 185, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 31, in __init__
self.process_group, rank, world_size = initialize_torch_distributed()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/dist.py", line 54, in initialize_torch_distributed
torch.cuda.set_per_process_memory_fraction(MEMORY_FRACTION, device)
File "/opt/conda/lib/python3.9/site-packages/torch/cuda/memory.py", line 118, in set_per_process_memory_fraction
torch._C._cuda_setMemoryFraction(fraction, device)
RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Then I tried disabling custom kernels and that seem to did the trick. Although I'm not sure why that helped, maybe the issue is with flash attention?
Wanted to add that I'm running llama2-70B on 8 A100 (40GB) GPUs.
Very odd, the version is indeed 11.8
on the Dockerfile for 0.9.4 : https://github.com/huggingface/text-generation-inference/blob/v0.9.4/Dockerfile#L44
huggingface/text-generation-inference:0.9.1
Try using actually 0.9.4
?
@Narsil I tried with 0.9.4 image, torch.version.cuda
shows 11.7 only. However even after upgrading the cuda version it did not help. Disabling custom kernels helped but I would prefer not to do it. Can you help me identify what would cause the above cuda error? Is it somehow related to flash attention?
RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
The error means that you're trying to load a cuda kernel that was compiled with a different version.
I'm going to try and confirm this.
Hmm I'm confused. I indeed see :
>>> torch.version.cuda
'11.7'
However the build script definitely says it's asking for 11.8... I'm going to stop for today, if you can can you check if newer images still have the issue (I'm kind of hoping for a transient issue with conda when we built the release):
Even with sha-7766fee I am seeing the same issue.
@Narsil Were you able to figure out the issue? Also I have llama2 model deployed on 8 A100 (40 GB) GPUs and had couple of quick questions around that. Can this model make use of flash attention? Is there any downside of running it with custom kernels disabled? Thanks in advance.
Just created a PR for it.
We're going to add peft
dependency and others which are already depending on PyTorch.
This should fix it, however I'll also incorportate your change since otherwise it's always going to be a cat and mouse game to see who's screwing up our cuda version.
Ok it is merged, could you try on latest (one it finishes uploading ?)
https://github.com/huggingface/text-generation-inference/actions/runs/5755083304
Edit: sha-f91e9d2
Did not fix for me. I still got:
RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
I'm running llama-2-13b-chat-hf on 1xA100 40GB GPU.
We're running mostly on those.... Do you mind opening a new issue and giving all the details you can provide ?
@Narsil With sha-f91e9d2 image I am seeing torch cuda version as 11.8 now. However I still need to add --disable-custom-kernels
in order to deploy llama model.
@marioluan You can try adding --disable-custom-kernels
to text-generation-launcher
command.
My bad, I was trying to deploy to a host running nvidia driver 470.182.03 and CUDA 11.4. CUDA 11.7 (and 11.8) are not compatible with that nvidia driver version. Unfortunately I can't upgrade the driver. Are there any chances to provide support for CUDA 11.4 in addition to 11.7 and 11.8?
Sorry no the 11.4 drivers actually have some stability issues regarding BF16/F16 so I'm not sure we want to support them.
You should however be able to modify the source in order to build for 11.4 yourself (just modify Dockerfile
and potentially pyproject.toml
as torch is built against cu11{7,8} for >2.0 (If I'm not mistaken)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.