recsys-examples
recsys-examples copied to clipboard
Examples for Recommenders - easy to train and deploy on accelerated infrastructure.
NVIDIA RecSys Examples
Overview
NVIDIA RecSys Examples is a collection of optimized recommender models and components.
The project includes:
- Examples for large-scale HSTU ranking and retrieval models through TorchRec and Megatron-Core integration
- HSTU (Hierarchical Sequential Transduction Unit) attention operator support
- Dynamic Embeddings with GPU acceleration
What's New
- [2025/12/10] 🎉v25.11 released!
- DynamicEmb supports embedding admission, that decides whether a new feature ID is allowed to create or update an embedding entry in the dynamic embedding table. By controlling admission, the system can prevent very rare or noisy IDs from consuming parameters and optimizer state that bring little training benefit.
- [2025/11/11] 🎉v25.10 released!
- HSTU training example supports sequence parallelism.
- DynamicEmb supports LRU score checkpointing, gradient clipping.
- Decouple scaling sequence length from the maximum sequence length limit in HSTU attention and extend HSTU support to the SM89 GPU architecture for training.
- [2025/10/20] 🎉v25.09 released!
- Integrated prefetching and caching into the HSTU training example.
- DynamicEmb now supports distributed embedding dumping and memory scaling.
- Added kernel fusion in the HSTU block for inference, including KVCache fixes.
- HSTU attention now supports FP8 quantization.
More
-
[2025/9/8] 🎉v25.08 released!
- Added cache support for dynamicemb, enabling seamless hot embedding migration between cache and storage.
- Released an end-to-end HSTU inference example, demonstrating precision aligned with training.
- Enabled evaluation mode support for dynamicemb.
-
[2025/8/1] 🎉v25.07 released!
- Released HSTU inference benchmark, including paged kvcache HSTU kernel, kvcache manager based on trt-llm, CUDA graph, and other optimizations.
- Added support for Tensor Parallelism in the HSTU layer.
-
[2025/7/4] 🎉v25.06 released!
- Dynamicemb lookup module performance improvement and LFU eviction support.
- Pipeline support for HSTU example, recompute support for HSTU layer and customized cuda ops for jagged tensor concat.
-
[2025/5/29] 🎉v25.05 released!
- Enhancements to the dynamicemb functionality, including support for EmbeddingBagCollection, truncated normal initialization, and initial_accumulator_value for Adagrad.
- Fusion of operations like layernorm and dropout in the HSTU layer, resulting in about 1.2x end-to-end speedup.
- Fix convergence issues on the Kuairand dataset.
Get Started
The examples we supported:
- HSTU recommender examples
Contribution Guidelines
Please see our contributing guidelines for details on how to contribute to this project.
Resources
Video
Blog
Community
Join our community channels to ask questions, provide feedback, and interact with other users and developers:
- GitHub Issues: For bug reports and feature requests
- NVIDIA Developer Forums
References
If you use RecSys Examples in your research, please cite:
@Manual{,
title = {RecSys Examples: A collection of recommender system implementations},
author = {NVIDIA Corporation},
year = {2024},
url = {https://github.com/NVIDIA/recsys-examples},
}
For more citation information and referenced papers, see CITATION.md.
License
This project is licensed under the Apache License - see the LICENSE file for details.