AlphaPose icon indicating copy to clipboard operation
AlphaPose copied to clipboard

Training Process Stuck

Open nicewiz opened this issue 2 years ago • 2 comments

I get started Alphapose from training my model following examples in README.md:

./scripts/train.sh ./configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml exp_fastpose

Everything seems perfect until:

Create new model => init weights loading annotations into memory... Done (t=5.33s) creating index... index created! ############# Starting Epoch 0 | LR: 0.001 ############# 0%| | 0/9364 [00:00<?, ?it/

I'm pretty sure something wrong happened in multiprocessing or torch.nn.DataParallel() after debugging. Forgot to save the error description, so I'm trying to reproduce it and will provide more details soon.

I checked published issues and didn't find any same situation or problem.

My env: python=3.6 torch=1.1.0 cuda=10.2

Look forward to any reply or further discussion about more details

Great thanks!

nicewiz avatar Mar 26 '22 07:03 nicewiz

more details below:

Traceback (most recent call last): File "./scripts/train.py", line 344, in main() File "./scripts/train.py", line 293, in main loss, miou = train(opt, train_loader, m, criterion, optimizer, writer) File "./scripts/train.py", line 52, in train output = m(inps) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply thread.join() File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/threading.py", line 1056, in join self._wait_for_tstate_lock() File "/home/chan/anaconda3/envs/alphapose/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt

nicewiz avatar Mar 26 '22 07:03 nicewiz

I'm not sure about the reason. Could you try to run python ./scripts/train.py --exp-id ${EXPID} --cfg ${CONFIG} --sync? Or could you try to reduce the thread number by --nThreads 0? The default nThreads is set to 60.

HaoyiZhu avatar Jun 21 '22 08:06 HaoyiZhu