dacapo icon indicating copy to clipboard operation
dacapo copied to clipboard

Multiprocessing error during validation in LocalTorch compute context

Open atc3 opened this issue 5 months ago • 3 comments

Describe the bug

When running cosem_example.ipynb on a local workstation with GPUs, the validation step during training throws the following error:

...
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

If I directly call validate_run outside of train_run, I get the same error:

from dacapo import validate_run

validate_run("cosem_distance_run_4nm", 2000)
Creating FileConfigStore:
	path: /home/[email protected]/dacapo/configs
Creating local weights store in directory /home/[email protected]/dacapo
Retrieving weights for run cosem_distance_run_4nm, iteration 2000
Validating run cosem_distance_run_4nm at iteration 2000...
Creating FileStatsStore:
	path    : /home/[email protected]/dacapo/stats
Validating run cosem_distance_run_4nm on dataset jrc_hela-2_recon-1/labels/groundtruth/crop6/[mito]_gt_jrc_hela-2_recon-1/labels/groundtruth/crop6/mito_s1_uint8_None_4nm
validation inputs already copied!
Predicting with input size (2304, 2304, 2304), output size (848, 848, 848)
Total input ROI: [11272:13728, 872:3328, 11352:13808] (2456, 2456, 2456), output ROI: [12000:13000, 1600:2600, 12080:13080] (1000, 1000, 1000)
Running blockwise prediction with worker_file:  /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Running blockwise with worker_file:  /home/[email protected]/dacapo-ml/dacapo/blockwise/predict_worker.py
Using compute context: LocalTorch(distribute_workers=False, _device=None, oom_limit=4.2)
ERROR:daisy.worker:worker (hostname=10.101.50.108:port=35859:task_id=predict_worker2024-09-25_16-08-03:worker_id=2) received exception: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Happy to provide a full stack trace if it helps.

I tried to fix this issue by explicitly setting the torch multiprocessing method to use spawn but then I got a different error and decided not to go too deep into that hole. I then got around this error by enabling distribute_workers in the LocalTorch compute context, and this somehow fixes the issue.

To Reproduce

Just run cosem_example.ipynb on any local workstation with a GPU

Versions:

  • OS: Ubuntu 22.04
  • CUDA Version: 12.2
  • 3 x NVIDIA RTX A5000, 24 GB memory each

atc3 avatar Sep 25 '24 20:09 atc3