SMQTK icon indicating copy to clipboard operation
SMQTK copied to clipboard

Parallel map function can hang during interruption or externally killed workers

Open Purg opened this issue 8 years ago • 7 comments

When Ctrl-C'ing a parallel-map in progress, an dead-lock can occur.

It has also been seen that if the workers are doing web-requests, they can lock up, possibly due to an infinite wait issue with the request. Then the threads or processes are killed externally, the function dead-locks and can't clean itself up properly.

Purg avatar Mar 03 '16 15:03 Purg

Since this can happen in the middle of GPU work, it can be left in a state where the GPU doesn't get to free its memory.

FWIW, this seems to be the best course of action: stopping X, calling nvidia-smi --gpu-reset, and starting X again.

danlamanna avatar Nov 16 '16 20:11 danlamanna

Haven't seen that yet... I'm assuming it happened to you?

Purg avatar Nov 16 '16 21:11 Purg

Yes.

danlamanna avatar Nov 16 '16 21:11 danlamanna

Welp, more reason to fix this thing again...

Purg avatar Nov 16 '16 21:11 Purg

i'm seeing something similar here @danlamanna and @Purg when trying the SMQTK quickstart and docker. I have 50 images and it just hangs building the network...sometimes it gets to batch 2, sometimes stays in batch 1:

I0422 04:53:40.881229    18 net.cpp:752] Ignoring source layer loss
  DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Network data shape: (10, 3, 227, 227)
  DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer
  DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -> {'data': (10, 3, 227, 227)}
  DEBUG - 2018-04-22 04:53:40,951 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Loading image mean
  DEBUG - 2018-04-22 04:53:40,952 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Image mean file not a numpy array, assuming protobuf binary.
  DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- mean
  DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- transpose
  DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- channel swap
   INFO - 2018-04-22 04:53:41,329 - __main__.run_file_list - Computing descriptors
  DEBUG - 2018-04-22 04:53:41,330 - smqtk.compute_functions.compute_many_descriptors - Using single async call
  DEBUG - 2018-04-22 04:53:41,331 - smqtk.compute_functions.compute_many_descriptors - Computing descriptors
  DEBUG - 2018-04-22 04:53:41,331 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Checking content types; aggregating data/descriptor elements.
  DEBUG - 2018-04-22 04:53:41,332 - smqtk.utils.parallel[check-file-type].parallel_map - Using all cores (2)
  DEBUG - 2018-04-22 04:53:42,613 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.report_progress - Loops per second 29.597158 (avg 29.597158) (31 this interval / 31 total)
  DEBUG - 2018-04-22 04:53:43,505 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Given 49 unique data elements
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - 0 descriptors already computed
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Converting deque to tuple for segmentation
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Processing 6 batches of size 8
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Processing tail group of size 1
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Starting batch: 1 of 6
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._process_batch - Updating network data layer shape (8 images)
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._process_batch - Loading image pixel arrays
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.utils.parallel.parallel_map - Using all cores (2)

Any ideas?

chrismattmann avatar Apr 22 '18 04:04 chrismattmann

BTW I'm using SMQTK and Image Space qiuckstart dockers...the ones that ref one another.

chrismattmann avatar Apr 22 '18 04:04 chrismattmann

FWIW I was able to get this working but only by repetitively stopping and starting smqtk-services docker...over and over....and randomly it works all the way sometimes for my 6 batches of ~50 images, and 90% of the time it just hangs.

chrismattmann avatar Apr 23 '18 03:04 chrismattmann