stylable mxnet gluon example high CPU usage

mxnet gluon example high CPU usage

Open jasperzhong opened this issue 5 years ago • 3 comments

Describe the bug

python3 /usr/local/byteps/launcher/launch.py python3 /usr/local/byteps/example/mxnet/train_gluon_mnist_byteps.py

Training runs on GPU normally. It is werid that as soon as invoking evaluate, CPU usage skyrockets to ~87% (the worker has 48 CPUs) and there is no GPU usage. And It takes a long time to execute evaluate function, which is too slow. This problem occurs every time.

We found this is because of val's dataloader.

https://github.com/bytedance/byteps/blob/master/example/mxnet/train_gluon_mnist_byteps.py#L69

val_iter = gluon.data.DataLoader(val_set, args.batch_size, False, num_workers=0)

When I set num_workers=args.j (default 2), everything becomes normal.

val_iter = gluon.data.DataLoader(val_set, args.batch_size, False, num_workers=args.j)

To Reproduce Steps to reproduce the behavior:

for each worker, run python3 /usr/local/byteps/launcher/launch.py python3 /usr/local/byteps/example/mxnet/train_gluon_mnist_byteps.py
See error

Expected behavior no such high CPU usage. and it should be fast.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

run in docker

Additional context Add any other context about the problem here.

Jan 17 '20 11:01 jasperzhong

Thank you for the fix!

This problem looks weird. Does it happen if the train loader also uses num_worker=0?

Jan 17 '20 11:01 ymjiang

Thank you for your prompt reply.

When I set num_worker=0 for the train loader, the same thing happens - high CPU usage and no GPU usage. This is probably a bug of mxnet. Training is also too slow.

Jan 17 '20 16:01 jasperzhong

@ymjiang I have reported this bug in https://github.com/apache/incubator-mxnet/issues/17381

Jan 19 '20 12:01 jasperzhong

stylable stylable copied to clipboard

mxnet gluon example high CPU usage

stylable
stylable copied to clipboard