examples
examples copied to clipboard
Weird non-determinstic behavior on PyTorch imagenet
Repro
- Apply https://github.com/pytorch/examples/pull/381 on this repo
-
cd
to the ImageNet folder - Run
python main.py --arch resnet18 --seed 0 --gpu 0 /path/to/imagenet/
on a multi-GPU machine, once withCUDA_VISIBLE_DEVICES=0
and once withCUDA_VISIBLE_DEVICES=1
.
Environment
- PyTorch master
- CUDA 9.0
- Driver 384.81
- Ubuntu 16.04
Expected behavior
The two runs have the same output.
Actual behavior
The two runs have the same output when you run them one after the other (e.g. GPU 0 first, then Ctrl-C, then GPU 1). But when you run them at the same time, you get different output.
Suspicion
This is a driver bug. I dunno how PyTorch would be able to bypass CUDA_VISIBLE_DEVICES
-based GPU segregation. But posting here for visibility anyways.
cc @shubho @SsnL @soumith @ailzhang
cc @ngimel
On Sat, Jul 7, 2018 at 06:56 Jerry Ma [email protected] wrote:
Repro
- Apply #381 https://github.com/pytorch/examples/pull/381 on this repo
- cd to the ImageNet folder
- Run python main.py --arch resnet18 --seed 0 --gpu 0 /path/to/imagenet/ on a multi-GPU machine, once with CUDA_VISIBLE_DEVICES=0 and once with CUDA_VISIBLE_DEVICES=1.
Environment
- PyTorch master
- CUDA 9.0
- Driver 384.81
- Ubuntu 16.04
Expected behavior
The two runs have the same output. Actual behavior
The two runs have the same output when you run them one after the other (e.g. GPU 0 first, then Ctrl-C, then GPU 1). But when you run them at the same time, you get different output. Suspicion
This is a driver bug. I dunno how PyTorch would be able to bypass CUDA_VISIBLE_DEVICES-based GPU segregation.
cc @shubho https://github.com/shubho @SsnL https://github.com/SsnL @soumith https://github.com/soumith @ailzhang https://github.com/ailzhang
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/examples/issues/382, or mute the thread https://github.com/notifications/unsubscribe-auth/AFaWZUyHLSDhEOhVdlhHmZn4JuiVkOrpks5uEJPJgaJpZM4VGUVT .
I remember at some point cudnn.deterministic=True
and cudnn.benchmark=True
did not guarantee deterministic behavior between runs, there was a discussion in one of the pytorch issues. Has this been fixed? (i'm on the phone now, so hard for me to search the issues and look at the code)