Implement general Parameter Server
A parameter server is a framework to asynchronously share parameters among machine learning workers for higher scalability. Hivemall currently has a standalone server implementation, named a MIX server, to asynchronously average parameters among workers for internal use only. To make the MIX server more general, we are planning to implement parameter server functionalities (e.g., cluster manager supports, optimizers to calculate deltas from gradients to update parameters, RPC protocols that third-party libraries use, and so on) based on the implementation.
We started some works as a first step:
- Support Cluster managers. The MIX server implementation currently supports a standalone mode and we can start a cluster of MIX servers through the start-up script. For easy operability, it is a good idea to deploy MIX servers via cluster managers, e.g.,
Apache Hadoop YARNandApache Mesos. We are working on aYARNintegration in #236, #246, and the topic branch. - Incorporate optimizer functionalities into the MIX server.
Hivemallhas optimizer functionalities in the core package. So, we'll separate them in #285 and then import in core and mixserv. - Define RPC protocols for general use. There are some works (e.g., #147) though, this interoperability issue is still open.
This ticket is to track related activities for parameter servers and please feel free to leave comments and advices here.
There is the ticket SPARK-6932 for parameter servers in the Spark JIRA. Already closed though, there are many valuable discussions and materials there.
There are some existing OSS parameter servers;
- dmlc parameter server (MXNet uses this implementation)
- petuum
- TernsorFlow internal parameter server
Other OSS implementations?