elasticdl icon indicating copy to clipboard operation
elasticdl copied to clipboard

Rewrite the master in Go after separating the parameter server from the master

Open wangkuiyi opened this issue 5 years ago • 0 comments

Currently, the master is in Python, because the master also works as the parameter server that calls TensorFlow's Python API to aggregate gradients and updates the global model.

In our plan, we will separate the parameter server from the master program after the MVP (minimal viable product) phase.

After the separation, it seems that we will no longer have to write the master program in Python. Instead, we can use Go. As a gRPC server, the master is a typical concurrent program, and Go brings great help writing concurrent programs.

For example, during training, the master scans the index of input RecordIO files and fills in the TODO task queue. This is a classical producer-consumer model. Given that the queue has a limited size, there might be too many tasks that could overflow the queue, especially when we are going to do many epochs. We should create a thread that fills in the queue, and the gRPC call GetTask retrieves tasks from the queue. Python's multithreading is problematic; however, Go can solve this very well.

For this example, the following producer can generate tasks of as many epochs as we want giving a queue with limited size.

func produceRecordIOTasks(idx recordio.Index,  epoch int, queue chan Task) {
    for i := 0; i < epoch; i++) {
        for scan over idx for task {
            queue <- task
        }
    }
    close(queue)
}

The master can start a goroutine that executes the producer:

func masterMain(recordIOFiles []string, epoch int) {
    idx := recordio.OpenMultipleFiles(recordIOFiles)
    queue := make(chan Task)
    go produceRecordIOTasks(idx, epoch, queue)
    // start the gRPC service and wait
}

wangkuiyi avatar Jun 29 '19 15:06 wangkuiyi