Support for AWS Trainium
Feature request
Add first-class support for running VERL on AWS Trainium (Trn1/Trn1n) via the Neuron SDK (PyTorch NeuronX / NxD Training). Ideally, VERL should be able to run PPO/GRPO-style RL training loops(including agent-loop and multi-turn) on Trainium in a similar way it currently supports NVIDIA GPUs, AMD ROCm, and Ascend, reusing existing Ray-based orchestration.
Motivation
NVIDIA GPUs are great, but they’re expensive and often hard to get at scale. AWS Trainium has been getting noticeably better for LLM training and offers a more cost-effective path for large workloads on AWS. It would be awesome if VERL could run on Trainium out of the box, so we can reuse the existing PPO/GRPO/RLHF stack while taking advantage of cheaper, more available hardware.
Your contribution
I can help with brainstorming the design, testing on Trainium, and potentially contributing to the implementation.