chainer-segnet icon indicating copy to clipboard operation
chainer-segnet copied to clipboard

The program hang both on CPU and GPU

Open pengzhao-intel opened this issue 7 years ago • 0 comments

Hi SegNet maintainer,

Very great repo for SegNet implementation of Chainer. We'd like to reproduce the results but the program is hanging and we also see several strange behaviors.

Does anything we miss?

Please help take a look and details as below.

Software:

  • Python: 3.6.0
  • Chainer: 1.21.0
  • NumPy: 1.12.0
  • OpenCV: 3.2.0

Hardware:

  • CPU: Xeon E5-2699
  • GPU: TESLA M40

Training method:

  • data collection according to README.md bash experiments/download.sh python lib/calc_mean.py
  • Leverage experiments/train.sh from the repo epoch = 5 --gpus -1
  • sbatch experiments/train.sh

Hang as below: In CPU, we locate the chainer/training/updater.py: ParallelUpdater: update_core() image

In GPU, blocked in iterators/multiprocess_iterator.py: MultiprocessIterator: _get() image

Interrupt the progress manually, we can get the SW stack as below: image

Multiple-forks When the script was running, we see lots of copies of Python have been launched (about ~40) and almost all of them are in sleep status. And this stage is very slowly. Is this the expected launching method?

Actually, I see the latest chainer of 2.0 is released in last week and we will try new version soon.

Thanks,

pengzhao-intel avatar Mar 01 '17 01:03 pengzhao-intel