stylable
stylable copied to clipboard
mxnet gluon example high CPU usage
Describe the bug
python3 /usr/local/byteps/launcher/launch.py python3 /usr/local/byteps/example/mxnet/train_gluon_mnist_byteps.py
Training runs on GPU normally. It is werid that as soon as invoking evaluate
, CPU usage skyrockets to ~87% (the worker has 48 CPUs) and there is no GPU usage. And It takes a long time to execute evaluate
function, which is too slow. This problem occurs every time.
We found this is because of val's dataloader.
https://github.com/bytedance/byteps/blob/master/example/mxnet/train_gluon_mnist_byteps.py#L69
val_iter = gluon.data.DataLoader(val_set, args.batch_size, False, num_workers=0)
When I set num_workers=args.j
(default 2), everything becomes normal.
val_iter = gluon.data.DataLoader(val_set, args.batch_size, False, num_workers=args.j)
To Reproduce Steps to reproduce the behavior:
- for each worker, run
python3 /usr/local/byteps/launcher/launch.py python3 /usr/local/byteps/example/mxnet/train_gluon_mnist_byteps.py
- See error
Expected behavior no such high CPU usage. and it should be fast.
Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
run in docker
Additional context Add any other context about the problem here.
Thank you for the fix!
This problem looks weird. Does it happen if the train loader also uses num_worker=0
?
Thank you for your prompt reply.
When I set num_worker=0
for the train loader, the same thing happens - high CPU usage and no GPU usage. This is probably a bug of mxnet. Training is also too slow.
@ymjiang I have reported this bug in https://github.com/apache/incubator-mxnet/issues/17381