oneCCL icon indicating copy to clipboard operation
oneCCL copied to clipboard

Make KVS Store service possible to start as independent service instead of keeping with one of the rank

Open umamaheswararao opened this issue 5 years ago • 0 comments

In Spark like deployments, Driver is a single point of failure but not workers. Keeping KVSStore with one of the worker makes one of the worker process as single point of failures.

If KVS can be started as stand alone process, the integration into spark like deployments will be easy. Driver can start this KVS Store and pass the KVSStore IP_Port to all workers. Rabit has the similar architecture, tracker( like KVStore here) starts with Driver. All workers connects to tracker.

umamaheswararao avatar Feb 21 '20 22:02 umamaheswararao