models [RMP] Performant large embedding table support

Model Parallel Support

[ ] ~Evaluation of HugeCTR, TorchRec, Distributed Embeddings, TFRA, PersiaML for inclusion in Merlin~
[ ] Distributed embedding table support in merlin-models (SOK Plugin, Distributed Embeddings)
[ ] Model Parallel Training
[ ] Third Gen Embeddings

~Feature engineering that reduces embedding size~

[ ] Mixed Dimension Embeddings
[x] Frequency Capping
[x] Frequency Hashing
[ ] Bloom Embeddings
[ ] TT-Rec

~Reduced Precision Support~

[ ] Sparse Row-wise Optimizers (Facebook Research DLRM)
[ ] Reduced Precision Optimizers
[ ] Reduced Embedding Precision

Not storing user embeddings

[ ] Represent user as item embedding aggregations (YouTube DNN)

Inference Support

[ ] Hierarchical Parameter Server Support

May 05 '22 23:05 EvenOldridge

This looks good! Two questions:

How does this relate to https://github.com/NVIDIA-Merlin/models/pull/282 (currently slated to be completed at the end of https://github.com/NVIDIA-Merlin/Merlin/issues/271)? Asking because @marcromeyn said the input block refactor "would also enable kickstarting the work of integrating model-parallelism for large-embedding tables (for instance through the HugeCTR SOK.)" Wondering to what extent the input block changes depend on the rest of the Models API changes, and if we can pull the input block work forward somehow to unblock whichever parts of model parallel support depend on it.
Are there further methods for not storing user embeddings planned here? If the aggregating item embeddings is the main/only one, we might want to capture that in https://github.com/NVIDIA-Merlin/Merlin/issues/279 instead. This looks like a ton of useful stuff that we haven't really captured anywhere before, but that one piece we can probably tackle as part of the YouTube DNN work.

May 06 '22 00:05 karlhigley

As an success criteria, we need to have benchmarks for each of the point above:

How does throughput change? (E.g. TF Keras vs. SOK. vs TFDE vs. reduced precision optimizer vs reduced precision embedding)
What is the AUC/performance of the model? (E.g. TF Keras vs. SOK. vs TFDE vs. reduced precision optimizer vs reduced precision embedding)

Customer ask us the questions and if we need to answer them, if we provide the functionality. Only if we add run the experiments, we can ensure that the implementation is correct.

Oct 10 '22 07:10 bschifferer

models models copied to clipboard

[RMP] Performant large embedding table support

models
models copied to clipboard