wide-residual-networks icon indicating copy to clipboard operation
wide-residual-networks copied to clipboard

Training usually doesn't start

Open max-reuter-2 opened this issue 6 years ago • 8 comments

I'm running this command: model=wide-resnet widen_factor=4 depth=40 dropout=0.3 ./scripts/debug_cifar.sh

Most of the time (80%+), the program will reach the point where it prints this:

Network has 40 convolutions Will save at logs/wide-resnet_1639021580 tput: No value for $TERM and no -T specified

...then it will do nothing. The other 20% of the time, it will begin training and printing out each epoch and its progress.

After a big of debugging, the stalling is occuring at engine:train in train.lua.

How can I fix this?

max-reuter-2 avatar Nov 30 '17 17:11 max-reuter-2

hm, that's odd, can you remove tee and check the output?

szagoruyko avatar Dec 12 '17 09:12 szagoruyko

What do you mean by tee?

max-reuter-2 avatar Dec 18 '17 16:12 max-reuter-2

https://github.com/szagoruyko/wide-residual-networks/blob/master/scripts/train_cifar.sh#L15 https://en.wikipedia.org/wiki/Tee_(command)

szagoruyko avatar Dec 18 '17 19:12 szagoruyko

If what you mean is to change this line in train_cifar.sh: th train.lua | tee $save/log.txt to this: th train.lua then it is still stalling.

max-reuter-2 avatar Dec 18 '17 19:12 max-reuter-2

hm, I'd assume that would be threads then, but these issues should have been fixed years ago. can you update threads and torchnet?

szagoruyko avatar Dec 18 '17 19:12 szagoruyko

I updated threads and torchnet, but I'm still getting the issue.

max-reuter-2 avatar Dec 18 '17 20:12 max-reuter-2

@soumith maybe you've seen issues like that with latest lua torch?

szagoruyko avatar Dec 18 '17 22:12 szagoruyko

lua-torch hasn't updated it's packages since July 2017: https://github.com/torch/distro/commits/master

I'm not sure what changed.

soumith avatar Dec 18 '17 22:12 soumith