Merlin [RMP] Evaluate training performance for two-tower and DLRM models

Problem

Customers have no basis from which to chose Merlin as their recommendation framework. In order to convince them we need to be able to demonstrate that our library achieves similar or superior performance both in terms of accuracy and in terms of runtime.

Blockers

[ ] Work related to systems is not captured here

Goal:

Create a performance chart of retrieval and ranking models comparison, being able to swap out different datasets passed to merlin models

Scope:

Validate Merlin Models functionality w/ performance evaluation for YouTube DNN and DLRM
Support model evaluation metrics - MAP, NDCG, Precision, etc (example - MSFT Recommenders)
Evaluate Training/Inference performance - time to train, serving latency, throughput, etc (example - JoC PyT DLRM)

Constraints:

Comparing across frameworks is challenging because they use different ranking metric calculations. #450 has been proposed to address this in the future. In the meantime we will evaluate in TF only.

Proposed starting point

Retrieval models

[ ] Evaluate Training performance of a two tower model - time to train, recall@K
[ ] Comparison to two-tower model in tensorflow recommenders

Ranking models

[ ] Evaluate Training performance of a DLRM model - time to train, NDCG@K or MAP@K
[ ] Comparison to JOC DLRM model
[ ] Comparison to tensorflow recommenders DLRM Model

Profiling

[ ] Profiling training of MM ranking and retrieval models with nsight / tensorflow profiler

###Virtual review

[ ] Virtual review of the models lib by TME

May 03 '22 14:05 karlhigley

We need to re evaluate the priority of this development

Jul 11 '22 16:07 viswa-nvidia

Closing as a duplicate of #553

Oct 12 '22 16:10 EvenOldridge