Support for AWS Trainium

Open cdm233 opened this issue 1 month ago • 0 comments

Feature request

Add first-class support for running VERL on AWS Trainium (Trn1/Trn1n) via the Neuron SDK (PyTorch NeuronX / NxD Training). Ideally, VERL should be able to run PPO/GRPO-style RL training loops(including agent-loop and multi-turn) on Trainium in a similar way it currently supports NVIDIA GPUs, AMD ROCm, and Ascend, reusing existing Ray-based orchestration.

Motivation

NVIDIA GPUs are great, but they’re expensive and often hard to get at scale. AWS Trainium has been getting noticeably better for LLM training and offers a more cost-effective path for large workloads on AWS. It would be awesome if VERL could run on Trainium out of the box, so we can reuse the existing PPO/GRPO/RLHF stack while taking advantage of cheaper, more available hardware.

Your contribution

I can help with brainstorming the design, testing on Trainium, and potentially contributing to the implementation.

Nov 26 '25 16:11 cdm233