ray-llm
ray-llm copied to clipboard
[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0.
I'm stuck in a repeat deployment loop when utilizing the image anyscale/ray-llm:latest
on a g5.12xlarge instance. It seems the worker never connects back which leads me to believe an error on deployment of docker image. I didn't notice any error logs reported to the head node during deployment.
This caused a repeated loop for deploying and shutting down workers. Possibly due to the CUDA updates, but I'm not 100% sure?
anyscale/ray-llm:0.4.0
launches as expected with no configuration changes.
Hi, please provide repo step if possible, so that our team can help to take a look!
- Update config to match requirements of my AWS env.
- SGs
- region
- updated
gpu_worker_g5
to includeCPU
andGPU
values.
- Deploy via Ray up
- Use Ray attach.
- Use
rayllm run --model models/continuous_batching/amazon--LightGPT.yaml
- continuous loop on deploy.
I don't believe the AMI has the drivers installed for CUDA 12. Could that be the issue?
@sihanwang41 an update on investigating this issue?
FWIW, ray-llm is not deployable in the current state on images >= 0.5.0
. This is not limited to g5.12xlarges.
+1 on this, I'm having to use 0.4.0 else DEPLOYING stuck in loop with 0.5.0 @JGSweets (thanks for your comment, got me up and running)