Merlin
Merlin copied to clipboard
[RMP] Evaluate training performance for two-tower and DLRM models
Problem
Customers have no basis from which to chose Merlin as their recommendation framework. In order to convince them we need to be able to demonstrate that our library achieves similar or superior performance both in terms of accuracy and in terms of runtime.
Blockers
- [ ] Work related to systems is not captured here
Goal:
- Create a performance chart of retrieval and ranking models comparison, being able to swap out different datasets passed to merlin models
Scope:
- Validate Merlin Models functionality w/ performance evaluation for YouTube DNN and DLRM
- Support model evaluation metrics - MAP, NDCG, Precision, etc (example - MSFT Recommenders)
- Evaluate Training/Inference performance - time to train, serving latency, throughput, etc (example - JoC PyT DLRM)
Constraints:
- Comparing across frameworks is challenging because they use different ranking metric calculations. #450 has been proposed to address this in the future. In the meantime we will evaluate in TF only.
Proposed starting point
Retrieval models
- [ ] Evaluate Training performance of a two tower model - time to train, recall@K
- [ ] Comparison to two-tower model in tensorflow recommenders
Ranking models
- [ ] Evaluate Training performance of a DLRM model - time to train, NDCG@K or MAP@K
- [ ] Comparison to JOC DLRM model
- [ ] Comparison to tensorflow recommenders DLRM Model
Profiling
- [ ] Profiling training of MM ranking and retrieval models with nsight / tensorflow profiler
###Virtual review
- [ ] Virtual review of the models lib by TME
We need to re evaluate the priority of this development
Closing as a duplicate of #553