DALI
DALI copied to clipboard
run dali pipeline reader in CPU and GPU got inconsistent training results
Hello, Dear @JanuszL @klecki I'm running a segmentation model using dali readers in CPU mode and GPU mode. It turns out that, using CPU got inconsistent compared with using readers in GPU.
Set up:
- reader CPU: batch size 16, num_threads=2, original_ratio=True, no augments, enetV1B1, all the dali ops are the same
- reader GPU: batch size 16, num_threads=2, original_ratio=True, no augments, enetV1B1, all the dali ops are the same
Question: The jobs run by CPU reader got much worse loss compared with GPU reader. For now, the GPU result is correct. I would like to ask if DALI ops are most optimized on GPU while not all ops are optimized for CPU, which can cause this inconsistent results?
Detail tensorboard as shown below:
-
seg_ratio_cpu/exp/train
(the pipeline reader returns float32 numpy array) -
seg_ratio_cpu_tensorfix/exp/train
(the pipeline reader returns float32 tf tensor which convert from numpy array) -
seg_ratio_gpu/exp/train
(pipeline feed intodali_tf.DALIDataset
)
train_losses/total_loss
train_metrics/MeanIoU
val_loss
valid_metrics/MeanIoU
training seems consistent for the first few batches. but later becomes different
CPU and GPU operators should always produce mathematically equivalent results. If you can the code for the DALI pipeline you are using (both CPU and GPU), I could take a look in case there's a bug.
@mayujie,
If you cannot share your code I would disable randomness, or set the same seed for the CPU and GPU pipelines and compare the results to see if the discrepancy comes from the DALI pipeline or something else.
Thanks for the reply. @JanuszL @jantonguirao sorry, i cannot share the code. But what I noticed that
my custom pipeline inherit from pipeline.Pipeline
I was using seed default equal to -1
super().__init__(batch_size, num_threads, device_id=dali_devices.device_id, seed=seed)
so for CPU and GPU pipelines, they are random. i will set same seed and run two jobs again.
besides, I saw my file reader ops was using seed = 1 by default, this should be set the same seed as for init
from the pipeline also?
image_reader = ops.FileReader(
seed=1,
files=self.images_list,
shuffle_after_epoch=shuffle_after_epoch,
)
You can set a random seed for both the pipeline and an individual operator. Those seeds don't need to match. When you set it for the whole pipeline, this seed is used to generate seeds for each operator. When you set it for one operator, you are overriding the seed that was generated. Overriding individual operator seeds is useful when you want to compare two different pipelines but want a particular operator to have predictable results. In your case, setting the seed on FileReader is a good thing because it can allow you to compare your CPU and GPU pipeline.
What I suggest is that you take the two pipelines and compare the outputs, adding one element to the pipeline at a time, so that you can see where is your source of error.
You can set a random seed for both the pipeline and an individual operator. Those seeds don't need to match. When you set it for the whole pipeline, this seed is used to generate seeds for each operator. When you set it for one operator, you are overriding the seed that was generated. Overriding individual operator seeds is useful when you want to compare two different pipelines but want a particular operator to have predictable results. In your case, setting the seed on FileReader is a good thing because it can allow you to compare your CPU and GPU pipeline.
What I suggest is that you take the two pipelines and compare the outputs, adding one element to the pipeline at a time, so that you can see where is your source of error.
sure, thank you very much. I will set two pipelines with the same seed first, and check results
Closing this issue now. If you still need help, please reopen.