elasticdl
elasticdl copied to clipboard
ElasticDL features for large scale recommendation
Good job on improving tensorflow on kubernetes for easy developing large scale training system. :-D
After reading some tutorials, we found ElasticDL designs new PS architecture and distributed framework, and want to ElasticDL team clarifys some more design considerations.
Large scale recommendation system requires several features from training system,
- efficiently handle large scale embedding while distributed training enabled,and it requires parameter servers and DL framework to satisfy sparse SGD updating. (How about ElasticDL's features for large scale embedding training)
- be compatiable with DL framework API to handle large scale embedding to make major model zoo works well.
How about ElasticDL?
ElasticDL can handle very large models using its general parameter server in Go, which is based on the previous design we explained in Google Developer Day 2019, but with many performance improvements.
@QiJune I think @backyes 's question is a very inspiring hint -- we should add a benchmark showing the capability of ElasticDL in supporting large models.
@backyes Thank you for your interest!
ElasticDL supports large embedding tables and also supports sparse SGD updating.
An embedding table will be sharded to several PS instances. In forward pass, workers pull embedding vectors from PS. In the backward pass, workers push embedding gradients (IndexedSlices data structure) to PS. Then, the sparse gradients will be applied to the embedding table in PS.
For more design details, please refer to parameter server and high performance PS.
For more implementation details, please refer to the Go PS code base and the RPC interface.
ElasticDL is also compatible with TensorFlow API well.
Users program their models withtf.keras.layers.Embedding
directly. ElasticDL supports native TensorFlow Keras API.
ElasticDL will substitute the embedding layer with elasticdl.layers.embedding before training. It's transparent to users.
@wangkuiyi Thank you for the advice. Yes, we could make an experiment, a recommendation model with large embedding tables.