ray-llm [BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0.

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0.

Open JGSweets opened this issue 1 year ago • 6 comments

I'm stuck in a repeat deployment loop when utilizing the image anyscale/ray-llm:latest on a g5.12xlarge instance. It seems the worker never connects back which leads me to believe an error on deployment of docker image. I didn't notice any error logs reported to the head node during deployment.

This caused a repeated loop for deploying and shutting down workers. Possibly due to the CUDA updates, but I'm not 100% sure?

anyscale/ray-llm:0.4.0 launches as expected with no configuration changes.

Jan 22 '24 18:01 JGSweets

Hi, please provide repo step if possible, so that our team can help to take a look!

Jan 22 '24 19:01 sihanwang41

Update config to match requirements of my AWS env.
- SGs
- region
- updated gpu_worker_g5 to include CPU and GPU values.
Deploy via Ray up
Use Ray attach.
Use rayllm run --model models/continuous_batching/amazon--LightGPT.yaml
- continuous loop on deploy.

Jan 22 '24 19:01 JGSweets

I don't believe the AMI has the drivers installed for CUDA 12. Could that be the issue?

Jan 22 '24 21:01 JGSweets

@sihanwang41 an update on investigating this issue?

Feb 02 '24 18:02 JGSweets

FWIW, ray-llm is not deployable in the current state on images >= 0.5.0. This is not limited to g5.12xlarges.

Feb 06 '24 18:02 JGSweets

+1 on this, I'm having to use 0.4.0 else DEPLOYING stuck in loop with 0.5.0 @JGSweets (thanks for your comment, got me up and running)

Mar 01 '24 08:03 SamComber

ray-llm ray-llm copied to clipboard

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0.

ray-llm
ray-llm copied to clipboard