AlphaPose
AlphaPose copied to clipboard
Training Process Stuck
I get started Alphapose from training my model following examples in README.md:
./scripts/train.sh ./configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml exp_fastpose
Everything seems perfect until:
Create new model => init weights loading annotations into memory... Done (t=5.33s) creating index... index created! ############# Starting Epoch 0 | LR: 0.001 ############# 0%| | 0/9364 [00:00<?, ?it/
I'm pretty sure something wrong happened in multiprocessing or torch.nn.DataParallel() after debugging. Forgot to save the error description, so I'm trying to reproduce it and will provide more details soon.
I checked published issues and didn't find any same situation or problem.
My env: python=3.6 torch=1.1.0 cuda=10.2
Look forward to any reply or further discussion about more details
Great thanks!
more details below:
Traceback (most recent call last): File "./scripts/train.py", line 344, in
main() File "./scripts/train.py", line 293, in main loss, miou = train(opt, train_loader, m, criterion, optimizer, writer) File "./scripts/train.py", line 52, in train output = m(inps) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply thread.join() File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/threading.py", line 1056, in join self._wait_for_tstate_lock() File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt
I'm not sure about the reason. Could you try to run python ./scripts/train.py --exp-id ${EXPID} --cfg ${CONFIG} --sync
? Or could you try to reduce the thread number by --nThreads 0
? The default nThreads is set to 60.