DeepRec supports multiple evaluator

Open cuiddyy opened this issue 3 years ago • 0 comments

Background

At present, DeepRec cannot support the evaluation of very large models on a single node. Multiple ps are required to load large models, and multiple workers are used for distributed evaluation.This can improve DeepRec's support for more scenarios

Realize ideas

Unlike training models, evaluating models does not require modifying the network structure to improve model accuracy, but instead requires consideration of how to improve the throughput of model evaluation and reduce evaluation latency. DeepRec already supports distributed training, and the evaluation is actually simpler compared to the training process because no updates to ps are involved. In the code, DeepRec first decides whether to initialize the cluster and how to initialize it according to the parameters.

There are two modes of distributed multi-evaluator evaluation of the system that need to be implemented. 1.Mode 1 contains ps, worker and evaluator nodes.DeepRec has implemented the case of a single evaluator in this mode，we need to implement multiple evaluators.One of the ideas is to directly add multiple evaluators to the initialization list of distributed clusters in DeepRec, or use the tf.distribute.Strategy interface 2.Mode 2 only has ps and evaluator nodes.The difference between this mode and mode 1 is that there is no need to train, just load the offline model that has been trained into ps and directly evaluate its performance.

Jul 10 '22 15:07 cuiddyy