examples icon indicating copy to clipboard operation
examples copied to clipboard

Weird non-determinstic behavior on PyTorch imagenet

Open jma127 opened this issue 6 years ago • 2 comments

Repro

  • Apply https://github.com/pytorch/examples/pull/381 on this repo
  • cd to the ImageNet folder
  • Run python main.py --arch resnet18 --seed 0 --gpu 0 /path/to/imagenet/ on a multi-GPU machine, once with CUDA_VISIBLE_DEVICES=0 and once with CUDA_VISIBLE_DEVICES=1.

Environment

  • PyTorch master
  • CUDA 9.0
  • Driver 384.81
  • Ubuntu 16.04

Expected behavior

The two runs have the same output.

Actual behavior

The two runs have the same output when you run them one after the other (e.g. GPU 0 first, then Ctrl-C, then GPU 1). But when you run them at the same time, you get different output.

Suspicion

This is a driver bug. I dunno how PyTorch would be able to bypass CUDA_VISIBLE_DEVICES-based GPU segregation. But posting here for visibility anyways.

cc @shubho @SsnL @soumith @ailzhang

jma127 avatar Jul 07 '18 10:07 jma127

cc @ngimel

On Sat, Jul 7, 2018 at 06:56 Jerry Ma [email protected] wrote:

Repro

  • Apply #381 https://github.com/pytorch/examples/pull/381 on this repo
  • cd to the ImageNet folder
  • Run python main.py --arch resnet18 --seed 0 --gpu 0 /path/to/imagenet/ on a multi-GPU machine, once with CUDA_VISIBLE_DEVICES=0 and once with CUDA_VISIBLE_DEVICES=1.

Environment

  • PyTorch master
  • CUDA 9.0
  • Driver 384.81
  • Ubuntu 16.04

Expected behavior

The two runs have the same output. Actual behavior

The two runs have the same output when you run them one after the other (e.g. GPU 0 first, then Ctrl-C, then GPU 1). But when you run them at the same time, you get different output. Suspicion

This is a driver bug. I dunno how PyTorch would be able to bypass CUDA_VISIBLE_DEVICES-based GPU segregation.

cc @shubho https://github.com/shubho @SsnL https://github.com/SsnL @soumith https://github.com/soumith @ailzhang https://github.com/ailzhang

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/examples/issues/382, or mute the thread https://github.com/notifications/unsubscribe-auth/AFaWZUyHLSDhEOhVdlhHmZn4JuiVkOrpks5uEJPJgaJpZM4VGUVT .

ssnl avatar Jul 07 '18 13:07 ssnl

I remember at some point cudnn.deterministic=True and cudnn.benchmark=True did not guarantee deterministic behavior between runs, there was a discussion in one of the pytorch issues. Has this been fixed? (i'm on the phone now, so hard for me to search the issues and look at the code)

ngimel avatar Jul 08 '18 00:07 ngimel