NVIDIA RecSys Examples

Overview

NVIDIA RecSys Examples is a collection of optimized recommender models and components.

The project includes:

Examples for large-scale HSTU ranking and retrieval models through TorchRec and Megatron-Core integration
HSTU (Hierarchical Sequential Transduction Unit) attention operator support
Dynamic Embeddings with GPU acceleration

What's New

[2025/12/10] 🎉v25.11 released!
- DynamicEmb supports embedding admission, that decides whether a new feature ID is allowed to create or update an embedding entry in the dynamic embedding table. By controlling admission, the system can prevent very rare or noisy IDs from consuming parameters and optimizer state that bring little training benefit.
[2025/11/11] 🎉v25.10 released!
- HSTU training example supports sequence parallelism.
- DynamicEmb supports LRU score checkpointing, gradient clipping.
- Decouple scaling sequence length from the maximum sequence length limit in HSTU attention and extend HSTU support to the SM89 GPU architecture for training.
[2025/10/20] 🎉v25.09 released!
- Integrated prefetching and caching into the HSTU training example.
- DynamicEmb now supports distributed embedding dumping and memory scaling.
- Added kernel fusion in the HSTU block for inference, including KVCache fixes.
- HSTU attention now supports FP8 quantization.

[2025/9/8] 🎉v25.08 released!
- Added cache support for dynamicemb, enabling seamless hot embedding migration between cache and storage.
- Released an end-to-end HSTU inference example, demonstrating precision aligned with training.
- Enabled evaluation mode support for dynamicemb.
[2025/8/1] 🎉v25.07 released!
- Released HSTU inference benchmark, including paged kvcache HSTU kernel, kvcache manager based on trt-llm, CUDA graph, and other optimizations.
- Added support for Tensor Parallelism in the HSTU layer.
[2025/7/4] 🎉v25.06 released!
- Dynamicemb lookup module performance improvement and LFU eviction support.
- Pipeline support for HSTU example, recompute support for HSTU layer and customized cuda ops for jagged tensor concat.
[2025/5/29] 🎉v25.05 released!
- Enhancements to the dynamicemb functionality, including support for EmbeddingBagCollection, truncated normal initialization, and initial_accumulator_value for Adagrad.
- Fusion of operations like layernorm and dropout in the HSTU layer, resulting in about 1.2x end-to-end speedup.
- Fix convergence issues on the Kuairand dataset.

For more detailed release notes, please refer our [releases](https://github.com/NVIDIA/recsys-examples/releases).

Get Started

The examples we supported:

HSTU recommender examples

Contribution Guidelines

Please see our contributing guidelines for details on how to contribute to this project.

Resources

Video

Blog

NVIDIA recsys-examples: 生成式推荐系统大规模训练推理的高效实践（上篇）

Community

Join our community channels to ask questions, provide feedback, and interact with other users and developers:

GitHub Issues: For bug reports and feature requests
NVIDIA Developer Forums

References

If you use RecSys Examples in your research, please cite:

@Manual{,
  title = {RecSys Examples: A collection of recommender system implementations},
  author = {NVIDIA Corporation},
  year = {2024},
  url = {https://github.com/NVIDIA/recsys-examples},
}

For more citation information and referenced papers, see CITATION.md.

License

This project is licensed under the Apache License - see the LICENSE file for details.

recsys-examples
recsys-examples copied to clipboard

Metadata

NVIDIA RecSys Examples

Overview

What's New

Get Started

Contribution Guidelines

Resources

Video

Blog

Community

References

License

← Metadata

Owner

Metadata

recsys-examples recsys-examples copied to clipboard

Metadata

NVIDIA RecSys Examples

Overview

What's New

Get Started

Contribution Guidelines

Resources

Video

Blog

Community

References

License

← Metadata

Owner

Metadata

recsys-examples
recsys-examples copied to clipboard