DALI run dali pipeline reader in CPU and GPU got inconsistent training results

Hello, Dear @JanuszL @klecki I'm running a segmentation model using dali readers in CPU mode and GPU mode. It turns out that, using CPU got inconsistent compared with using readers in GPU.

Set up:

reader CPU: batch size 16, num_threads=2, original_ratio=True, no augments, enetV1B1, all the dali ops are the same
reader GPU: batch size 16, num_threads=2, original_ratio=True, no augments, enetV1B1, all the dali ops are the same

Question: The jobs run by CPU reader got much worse loss compared with GPU reader. For now, the GPU result is correct. I would like to ask if DALI ops are most optimized on GPU while not all ops are optimized for CPU, which can cause this inconsistent results?

Detail tensorboard as shown below:

seg_ratio_cpu/exp/train (the pipeline reader returns float32 numpy array)
seg_ratio_cpu_tensorfix/exp/train (the pipeline reader returns float32 tf tensor which convert from numpy array)
seg_ratio_gpu/exp/train (pipeline feed into dali_tf.DALIDataset)

train_losses/total_loss train_metrics/MeanIoU

val_loss

valid_metrics/MeanIoU

Jun 08 '22 12:06 mayujie

training seems consistent for the first few batches. but later becomes different

Jun 08 '22 12:06 mayujie

CPU and GPU operators should always produce mathematically equivalent results. If you can the code for the DALI pipeline you are using (both CPU and GPU), I could take a look in case there's a bug.

Jun 13 '22 07:06 jantonguirao

@mayujie,

If you cannot share your code I would disable randomness, or set the same seed for the CPU and GPU pipelines and compare the results to see if the discrepancy comes from the DALI pipeline or something else.

Jun 13 '22 07:06 JanuszL

Thanks for the reply. @JanuszL @jantonguirao sorry, i cannot share the code. But what I noticed that

my custom pipeline inherit from pipeline.Pipeline I was using seed default equal to -1 super().__init__(batch_size, num_threads, device_id=dali_devices.device_id, seed=seed)

so for CPU and GPU pipelines, they are random. i will set same seed and run two jobs again.

besides, I saw my file reader ops was using seed = 1 by default, this should be set the same seed as for init from the pipeline also?

        image_reader = ops.FileReader(
            seed=1,
            files=self.images_list,
            shuffle_after_epoch=shuffle_after_epoch,
        )

Jun 13 '22 09:06 mayujie

You can set a random seed for both the pipeline and an individual operator. Those seeds don't need to match. When you set it for the whole pipeline, this seed is used to generate seeds for each operator. When you set it for one operator, you are overriding the seed that was generated. Overriding individual operator seeds is useful when you want to compare two different pipelines but want a particular operator to have predictable results. In your case, setting the seed on FileReader is a good thing because it can allow you to compare your CPU and GPU pipeline.

What I suggest is that you take the two pipelines and compare the outputs, adding one element to the pipeline at a time, so that you can see where is your source of error.

Jun 13 '22 09:06 jantonguirao

You can set a random seed for both the pipeline and an individual operator. Those seeds don't need to match. When you set it for the whole pipeline, this seed is used to generate seeds for each operator. When you set it for one operator, you are overriding the seed that was generated. Overriding individual operator seeds is useful when you want to compare two different pipelines but want a particular operator to have predictable results. In your case, setting the seed on FileReader is a good thing because it can allow you to compare your CPU and GPU pipeline.

What I suggest is that you take the two pipelines and compare the outputs, adding one element to the pipeline at a time, so that you can see where is your source of error.

sure, thank you very much. I will set two pipelines with the same seed first, and check results

Jun 13 '22 10:06 mayujie

Closing this issue now. If you still need help, please reopen.

Feb 06 '23 10:02 jantonguirao

DALI DALI copied to clipboard

run dali pipeline reader in CPU and GPU got inconsistent training results

DALI
DALI copied to clipboard