ray icon indicating copy to clipboard operation
ray copied to clipboard

[Train] Ray Train should support AWS trainium instances

Open gilvikra opened this issue 2 years ago • 4 comments

Description

I would like AWS trainium instances requiring "xla" torch backend be supported with ray.

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/distributed_data_parallel.html#neur[…]rial

There is a great push towards Trainium and right now ray does not seem to support it natively like CPU and GPUs

Use case

Use of AWS Trainium chips for efficient, performant, cost effective distributed training on top of ray.

gilvikra avatar Mar 21 '23 02:03 gilvikra

+1. Given the shortage of GPUs in the industry, it would be beneficial for us to have Ray tested and supported on AWS Trainium, to unblock LLM use cases.

swaroopch avatar Jul 11 '23 20:07 swaroopch

Follow-up issue: https://github.com/ray-project/ray/issues/38473. This improves the maintainability of https://github.com/ray-project/ray/pull/37998 by removing the need to continuously update a hard-coded dictionary of EC2 instance types to neuron core counts.

pdames avatar Aug 15 '23 18:08 pdames

@woshiyyya can you take a look; I'm adding triage as well in case we want to punt this to the next on-call rotation.

anyscalesam avatar Apr 02 '24 19:04 anyscalesam

@anyscalesam OK. Will take a look at the CI issue of #39130.

woshiyyya avatar Apr 02 '24 21:04 woshiyyya