examples Weird non-determinstic behavior on PyTorch imagenet

Repro

Apply https://github.com/pytorch/examples/pull/381 on this repo
cd to the ImageNet folder
Run python main.py --arch resnet18 --seed 0 --gpu 0 /path/to/imagenet/ on a multi-GPU machine, once with CUDA_VISIBLE_DEVICES=0 and once with CUDA_VISIBLE_DEVICES=1.

Environment

PyTorch master
CUDA 9.0
Driver 384.81
Ubuntu 16.04

Expected behavior

The two runs have the same output.

Actual behavior

The two runs have the same output when you run them one after the other (e.g. GPU 0 first, then Ctrl-C, then GPU 1). But when you run them at the same time, you get different output.

Suspicion

This is a driver bug. I dunno how PyTorch would be able to bypass CUDA_VISIBLE_DEVICES-based GPU segregation. But posting here for visibility anyways.

cc @shubho @SsnL @soumith @ailzhang

Jul 07 '18 10:07 jma127

cc @ngimel

On Sat, Jul 7, 2018 at 06:56 Jerry Ma [email protected] wrote:

Repro

Apply #381 https://github.com/pytorch/examples/pull/381 on this repo

cd to the ImageNet folder

Run python main.py --arch resnet18 --seed 0 --gpu 0 /path/to/imagenet/ on a multi-GPU machine, once with CUDA_VISIBLE_DEVICES=0 and once with CUDA_VISIBLE_DEVICES=1.

Environment

PyTorch master

CUDA 9.0

Driver 384.81

Ubuntu 16.04

Expected behavior

The two runs have the same output. Actual behavior

The two runs have the same output when you run them one after the other (e.g. GPU 0 first, then Ctrl-C, then GPU 1). But when you run them at the same time, you get different output. Suspicion

This is a driver bug. I dunno how PyTorch would be able to bypass CUDA_VISIBLE_DEVICES-based GPU segregation.

cc @shubho https://github.com/shubho @SsnL https://github.com/SsnL @soumith https://github.com/soumith @ailzhang https://github.com/ailzhang

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pytorch/examples/issues/382, or mute the thread https://github.com/notifications/unsubscribe-auth/AFaWZUyHLSDhEOhVdlhHmZn4JuiVkOrpks5uEJPJgaJpZM4VGUVT .

Jul 07 '18 13:07 ssnl

I remember at some point cudnn.deterministic=True and cudnn.benchmark=True did not guarantee deterministic behavior between runs, there was a discussion in one of the pytorch issues. Has this been fixed? (i'm on the phone now, so hard for me to search the issues and look at the code)

Jul 08 '18 00:07 ngimel

examples examples copied to clipboard

Weird non-determinstic behavior on PyTorch imagenet

Repro

Environment

Expected behavior

Actual behavior

Suspicion

examples
examples copied to clipboard