ray-llm Possible to run on a single 8x A100 machine on-premise?

Possible to run on a single 8x A100 machine on-premise?

Open wired-mind opened this issue 1 year ago • 4 comments

I would like to run a single machine that is on-premise, but not able to get the models to load as it is looking for actor/worker resource nodes that don't exist. Do you have any config example for single machine on-premise?

Jun 14 '23 19:06 wired-mind

Aviary requires a Ray Cluster to run. You can set up an on-premise Ray Cluster (https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/on-premises.html). Because Aviary uses Ray Custom Resources to ensure that each model is scheduled on an intended GPU type, you will need to set those in both the Ray cluster configuration and Aviary model yamls.

You can edit the EC2 config to use on-prem instead with your desired node type.

Alternatively, if you just want to experiment, you can do the following:

SSH into your GPU node,
load the docker image/install Aviary locally with pip install -e ".[backend, frontend]"
edit the scaling_config section in model configuration and change the accelerator_type_[TYPE] to accelerator_type_a100
start ray with ray start --resources "{\"accelerator_type_a100\": 1}" (the actual number of GPUs will be detected automatically)
start aviary with aviary run --model model_yaml_with_edited_scaling_config.yaml

This will start a Ray cluster composed of just this single node.

Jun 14 '23 19:06 Yard1

Perfect, thank you. Got it all working along with the frontend in a docker container. One problem I encountered was that both the frontend and backend default to port 8000, so the front end needed to be started like this: serve run --host 0.0.0.0 --port 7860 aviary.frontend.app:app

Jun 19 '23 04:06 wired-mind

@Yard1 what do you think about making the frontend run on port 7860 by default to be consistent with normal Gradio and not cause this problem?

Jun 20 '23 17:06 waleedkadous

I think that's a good idea!

Jun 20 '23 19:06 Yard1

ray-llm ray-llm copied to clipboard

Possible to run on a single 8x A100 machine on-premise?

ray-llm
ray-llm copied to clipboard