ffn Multi-GPU utilization

I am looking for a way to correctly applying multi-GPU to train and inference.

I am now using multi-GPU to inference a large volume data separately. The labels generated by multi GPU turns out independent, and it's still confusing how I can combine them into a large one.

Really appreciate any help or insights.

Oct 15 '21 02:10 DeadpanZiao

To utilize multiple GPUs during inference, generating independent segmentation for partially overlapping subvolumes is the typical thing to do. These need to be reconciled and assembled into a global segmentation. One generates ID equivalences by looking at the subvolume overlap area, computes the connected components of the resulting graph (e.g. using a union-find data structure), and writes the individual subvolumes into some volumetric storage system (e.g. using TensorStore) while relabelling the segments according to the CCs.

For training, the code is currently configured to use asynchronous SGD. One can start a process as a 'parameter server', and then some number of independent workers (one GPU each; can be on different machines) which connect to it and train together as a flock.

Oct 15 '21 10:10 mjanusz

Really appreciate the reply!

I have done making some overlapping labels. I checked there were some resegmentation function in the repo. I am not sure if they are in the way you mentioned here, and I find no scripts to starting them. It would be even better if you could provide a script to run reconcile.

Thanks again for the explanation.

Oct 19 '21 06:10 DeadpanZiao

We unfortunately don't have this functionality in the main ffn repo, but I'm aware of at least one third party solution (https://github.com/Hanyu-Li/klab_utils/tree/master/klab_utils/ffn/reconciliation) which you might be able to use. IIUC, the process is to run remap.py (unique IDs per subvolume), find_graph.py (equivalences from overlapping subvolumes) and agglomerate_cv.py (update IDs according to the graph built in the previous step).

Oct 19 '21 14:10 mjanusz

Really appreciate it. I will spend some time run the code and try to implement on our labels. By the way, I am wondering how you guys generate large labels. Large scale volume labels seem to be highly dependent on and limited to memory. We have got 8 Tesla-v100 GPU, but as far as I learned, it takes months to acquire the whole labels. Is this the same problem for you as well?

Kind regards.

Oct 26 '21 01:10 DeadpanZiao