ffcv [Bug] Synchronization issue on GPU

I'm using the v0.0.4 version from this branch: https://github.com/libffcv/ffcv/tree/v0.0.4

There's a (possibly major) bug where two models will not receive the same inputs from the FFCV dataloader, unless torch.cuda.synchronize is explicitly called. Below is a simple code snippet to reproduce this issue:

import torch
from torchvision.models import resnet18
from tqdm import tqdm
from copy import deepcopy

dataloader = create_ffcv_dataloader()  # Your own custom dataloader factory
model1 = resnet18(pretrained=False).cuda()
model2 = deepcopy(model1)
with torch.no_grad():
    for it, (imgs, *_) in enumerate(tqdm(dataloader)):
        model1(imgs)
        model2(imgs)
        # Uncommenting the following line will pass the assertion at the bottom, while leaving it commented will trigger assertion error
        # torch.cuda.synchronize()         
        if it == 20:
            break

    assert model1.bn1.running_mean.allclose(model2.bn1.running_mean)

BatchNorm tracks running stats, which can be used to check whether two identical models received the same inputs on the forward pass. Without torch.cuda.synchronize(), the above code will trigger an assertion error, since the two models received different inputs at some point. With torch.cuda.synchronize(), no assertion error will be triggered. Also, I have noticed that this behavior does not necessarily happen with larger models, where the forward pass takes a longer time.

Mar 21 '22 07:03 numpee

@andrewilyas this might be a bug similar to the one we had on our SAGA solver

Mar 21 '22 08:03 GuillaumeLeclerc

I've updated the code.

from copy import deepcopy

import numpy as np
import torch
from ffcv import Loader
from ffcv.fields.basics import IntDecoder
from ffcv.fields.rgb_image import SimpleRGBImageDecoder
from ffcv.loader import OrderOption
from ffcv.transforms import RandomHorizontalFlip, ToTensor, ToDevice, ToTorchImage, NormalizeImage, Squeeze
from torchvision.models import resnet18
from tqdm import tqdm


def main():
    beton_path = 'cifar10_val.beton'    # Your FFCV .beton file here
    image_pipeline = [SimpleRGBImageDecoder(), RandomHorizontalFlip(), ToTensor(),
                      ToDevice(torch.device(0), non_blocking=True), ToTorchImage(),
                      NormalizeImage(mean=np.array([0.5, 0.5, 0.5]) * 255, std=np.array([1., 1., 1., ]) * 255,
                                     type=np.float32)]
    label_pipeline = [IntDecoder(), ToTensor(), Squeeze(), ToDevice(torch.device(0), non_blocking=True)]
    loader = Loader(beton_path, batch_size=128, num_workers=4, order=OrderOption.SEQUENTIAL, os_cache=True,
                    drop_last=False, pipelines={'image': image_pipeline, 'label': label_pipeline})
    model1 = resnet18(pretrained=False).cuda()
    model2 = deepcopy(model1)

    with torch.no_grad():
        for it, (imgs, *_) in enumerate(tqdm(loader)):
            model1(imgs)
            model2(imgs)
            # torch.cuda.synchronize()
            # breakpoint()
            if it == 20:
                break

        assert model1.bn1.running_mean.allclose(model2.bn1.running_mean)


if __name__ == "__main__":
    main()

Other things to note:

Adding torch.cuda.synchronize() anywhere within the loop will NOT trigger the assertion on the last line
Adding a breakpoint within to loop and iterating over each loop slowly will NOT trigger the assertion on the last line
Adding assertion on the output of the two models, for example out1 = model1(imgs); out2 = model2(imgs); assert out1.allclose(out2) will NOT trigger, nor will it trigger the assertion on the last line

Mar 21 '22 08:03 numpee

@andrewilyas this might be a bug similar to the one we had on our SAGA solver

I think that was actually fixed in 0.0.4---at least the failing test. Could this have something to do with the cudnn simultaneous batch norm bug? Is there non-BN-based evidence of the images being different?

Mar 22 '22 05:03 andrewilyas

@andrewilyas I am not aware of the CuDNN simultaneous batch norm bug - could you link me to a reference? If it is a BN related bug, what I don't understand is why adding synchronization points anywhere in the loop would fix the issue (assertion passes), and why the issue is not reproducible with a PyTorch native dataloader.

I actually discovered this bug while testing code for something else, where I was comparing feature representation between two identical models with identical inputs (as a test). In such scenarios, the metric should return ones on the diagonal, but they did not until I explicitly added a call to torch.cuda.synchronize().

Mar 22 '22 06:03 numpee

@numpee did you experience this with 0.0.4 ? I'm going to try to reproduce on my workstation on v1.0.0 now.

Mar 25 '22 21:03 GuillaumeLeclerc

Hi @numpee and thank you for this invaluable test case. I can 100% confirm that there is an issue and that it's FFCV at fault. I'll have to investigate that

Mar 25 '22 22:03 GuillaumeLeclerc

Here is the strictly minimal version of the test that makes it fail 100% of the time on my 3090:

import torch
from ffcv import Loader
from ffcv.fields.rgb_image import SimpleRGBImageDecoder
from ffcv.transforms import ToTensor, ToDevice, ToTorchImage, Convert
from torchvision.models import resnet18
from tqdm import tqdm

def main():
    beton_path = 'cifar_train.beton'    # Your FFCV .beton file here
    image_pipeline = [SimpleRGBImageDecoder(), ToTensor(),
                      ToDevice(torch.device(0), non_blocking=False), ToTorchImage(),
                      Convert(torch.float32),
      ]
    loader = Loader(beton_path, batch_size=512, num_workers=1,
                    pipelines={'image': image_pipeline, 'label': None})
    model1 = resnet18(pretrained=False).cuda()
    model2 = resnet18(pretrained=False).cuda()
    model2.load_state_dict(model1.state_dict())

    while True:
        with torch.no_grad():
            for it, (imgs,) in enumerate(tqdm(loader)):
                # imgs = imgs.clone()
                model1(imgs)
                model2(imgs)
                # torch.cuda.synchronize()
                # breakpoint()
                if it == 2: # 1 works sometimes but not 100% of the time on my GPU
                    break

            assert model1.bn1.running_mean.allclose(model2.bn1.running_mean)


if __name__ == "__main__":
    main()

Mar 25 '22 22:03 GuillaumeLeclerc

So what I suspect to be happening is that FFCV is overwriting the current imgs with a future batch. The crazy thing is that we encountered that bug and the code be ok now. I guess the next move is to fire a profile and look at how the operations actually unfold on the GPU itself

Mar 25 '22 22:03 GuillaumeLeclerc

@GuillaumeLeclerc, thanks for looking into this. It does seem to be a very weird bug. Please let me know if there are any ways I can help

Mar 26 '22 01:03 numpee

Well.. is a problem that you guys have had related with ``torch.nn.SyncBatchNorm'' on DDP?

I have combined DDP with ``torch.nn.SyncBatchNorm'' to avoid batch-norm bug.

As below, just try to wrap the model

model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)

If not addressed, I am sorry

Mar 29 '22 04:03 ByungKwanLee

@ByungKwanLee the tests above were conducted on single-gpu, so I don't think SyncBN applies in this case.

Mar 29 '22 08:03 numpee

Is this still an issue?

Jun 18 '22 06:06 ericjang

@numpee Can you confirm that this is fixed on the v1.0.0 branch?

Sep 30 '22 04:09 andrewilyas

@andrewilyas @GuillaumeLeclerc Just tested the code - this seems to be fixed in the fix-198 branch. Doesn't seem to be merged in the v1.0.0 branch yet.

Thanks for the fix! I'm surprised the fix was just a single line of code.

Sep 30 '22 15:09 numpee

Sorry for the delay on this, I'll merge this into 1.0.0 now.

Feb 27 '23 14:02 andrewilyas

ffcv ffcv copied to clipboard

[Bug] Synchronization issue on GPU

Other things to note:

ffcv
ffcv copied to clipboard