chainer-segnet
chainer-segnet copied to clipboard
The program hang both on CPU and GPU
Hi SegNet maintainer,
Very great repo for SegNet implementation of Chainer. We'd like to reproduce the results but the program is hanging and we also see several strange behaviors.
Does anything we miss?
Please help take a look and details as below.
Software:
- Python: 3.6.0
- Chainer: 1.21.0
- NumPy: 1.12.0
- OpenCV: 3.2.0
Hardware:
- CPU: Xeon E5-2699
- GPU: TESLA M40
Training method:
- data collection according to README.md bash experiments/download.sh python lib/calc_mean.py
- Leverage experiments/train.sh from the repo epoch = 5 --gpus -1
- sbatch experiments/train.sh
Hang as below:
In CPU, we locate the chainer/training/updater.py: ParallelUpdater: update_core()
In GPU, blocked in iterators/multiprocess_iterator.py: MultiprocessIterator: _get()
Interrupt the progress manually, we can get the SW stack as below:
Multiple-forks When the script was running, we see lots of copies of Python have been launched (about ~40) and almost all of them are in sleep status. And this stage is very slowly. Is this the expected launching method?
Actually, I see the latest chainer of 2.0 is released in last week and we will try new version soon.
Thanks,