EasyDeL Performance comparison with the Pytorch ecosystem

Hi, congrats and the great job you're doing in this repo. I set out with a similar goal of implementing hackable and high-performance inference and post-training package in JAX and then found this repository. It seems that most of the ideas in my mind were already implemented here. I wonder if you have any comparative benchmark of inference and training performance with the Pytorch ecosystem as the baseline, e.g., VLLM for inference and Unsloth for post-training etc. I know that the true value proposition might be the performance on TPUs, but it would be great if we could achieve comparable performance with the highly optimized Pytorch ecosystem on GPUs as well.

Jun 29 '25 13:06 monatis

Hi @monatis, and apologies for the late reply!

Thanks for the kind words and for checking out the repo. I haven’t put together a formal benchmark report comparing against VLLM or Unsloth yet, but based on the profiling and benchmarking I’ve done so far, EasyDeL tends to compare closely with — and in some cases outperform — VLLM on long sequence lengths (e.g., >64K).

When it comes to training, JAX-based trainers generally shine in terms of speed and scalability. In my experience, EasyDeL demonstrates better scaling laws, faster throughput, and more efficient sharding strategies compared to Unsloth, especially when leveraging TPUs. That said, GPU performance is also competitive (on fsdp, tp, and sp), and I’m planning to document that more thoroughly in future releases.

Jul 10 '25 22:07 erfanzar

Hi @erfanzar, thanks for your sincere response. At the startup I'm leading, we're currently making use of Unsloth + VLLM, but I'd like to replace it with a JAX-based framework due to the similar past experiences with JAX in other projects. I've already started an experimental library in JAX, but it can take a lot of time and effort to make it production-ready, and I think EasyDeL is very promising in that regard. I'm not sure about your position with external contributions, but I'd like to contribute with such benchmarks first and then with more meaningful contributions such as new model architectures, other optimizations etc.

if it's ok to you, I can design proper benchmarks and contribute reproduceable scripts. This will also help me better understand the internals of EasyDeL before contributing lower-level implementations.

Jul 17 '25 13:07 monatis

Hi @monatis, and apologies for the late reply!

Thanks for the kind words and for checking out the repo. I haven’t put together a formal benchmark report comparing against VLLM or Unsloth yet, but based on the profiling and benchmarking I’ve done so far, EasyDeL tends to compare closely with — and in some cases outperform — VLLM on long sequence lengths (e.g., >64K).

When it comes to training, JAX-based trainers generally shine in terms of speed and scalability. In my experience, EasyDeL demonstrates better scaling laws, faster throughput, and more efficient sharding strategies compared to Unsloth, especially when leveraging TPUs. That said, GPU performance is also competitive (on fsdp, tp, and sp), and I’m planning to document that more thoroughly in future releases.

@erfanzar Hi，any progress？

Nov 04 '25 22:11 MoFHeka