distiller icon indicating copy to clipboard operation
distiller copied to clipboard

Fails to resume training CPU-->GPU

Open barrh opened this issue 5 years ago • 2 comments

When I try to resume training (with compress_classifier.py), load_checkpoint() fails to load the compress_scheduler, due to KeyError. This happens only when trying to resume training on GPU, when the checkpoint was previously saved for CPU. It's very plausible that there's similar bug resuming the optimizer, and it just fails 1st for loading the compress_scheduler. If there's a problem with loading the optimizer, it will probably only affect/fail during actual training.

code: (master is d59888c9d0e6539edc35c86c6e7c3c869acb90f1)

$ python3 examples/classifier_compression/compress_classifier.py --arch=resnet20_cifar ../data.cifar10/ --compress=/tmp/lr.yaml --resume=/tmp/2019.03.13-003504/checkpoint.pth.tar Log file for this run: /tmp/2019.03.13-003614/2019.03.13-003614.log => creating resnet20_cifar model for CIFAR10


Logging to TensorBoard - remember to execute the server:

tensorboard --logdir='./logs'

=> loading checkpoint /tmp/2019.03.13-003504/checkpoint.pth.tar best top@1: 9.740

Log file for this run: /tmp/2019.03.13-003614/2019.03.13-003614.log Traceback (most recent call last): File "/distiller/apputils/checkpoint.py", line 106, in load_checkpoint compression_scheduler.load_state_dict(checkpoint['compression_sched'], normalize_dataparallel_keys) File "/distiller/distiller/scheduler.py", line 233, in load_state_dict masker.mask = loaded_masks[name] KeyError: 'module.conv1.weight'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "examples/classifier_compression/compress_classifier.py", line 752, in main() File "examples/classifier_compression/compress_classifier.py", line 163, in main model, compression_scheduler, start_epoch = apputils.load_checkpoint(model, chkpt_file=args.resume) File "distiller/distiller/apputils/checkpoint.py", line 111, in load_checkpoint compression_scheduler.load_state_dict(checkpoint['compression_sched'], normalize_dataparallel_keys) File "/distiller/distiller/scheduler.py", line 233, in load_state_dict masker.mask = loaded_masks[name] KeyError: 'module.conv1.weight'

barrh avatar Mar 12 '19 22:03 barrh

This happens only when trying to resume training on GPU, when the checkpoint was previously saved for CPU.

Yes, we support training on the GPU and then loading on the CPU, but not vice versa. This is because we don't think it's very practical to train on the CPU anything of significant size, and so we didn't invest time in getting this direction to work.

nzmora avatar Mar 12 '19 23:03 nzmora

The prioritization argument makes total sense. However, the error should be clearer to the user. probably NotImplementError('cpu to gpu imports are not supported')...

Also, consider the following scenario: resuming (weights only) and then applying quantization on cpu (doesn't sound that far-fetched). This bug would render the results useless for gpu-equipped target system. Perhaps we should go as far as preventing non-evaluation cpu jobs.

barrh avatar Mar 13 '19 00:03 barrh