DALI Image semantic segmentation example with tf.data object on multi gpu training

Hello, I am training image semantic segmentation network on multiple gpu (4 gpus). Currently my data pipeline using tf.data.experimental_distribute_dataset from tensorflow as data pipeline and mirrored strategy. I want to use DALI library for faster performance. I have checkout some examples, but non of them helped me. Below is the code that I have added but not working. Any suggestion would be helpful.

Directory Structure:

|-- Root
|  |-- train
|  |  |-- images (png images)
|  |  |-- labels   (png label segmentation mask)
|  |-- valid
|  |  |-- images
|  |  |-- labels
|  |-- test
|  |  |-- images
|  |  |-- labels

I have added code below in my example.

@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
    pngs = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "images"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)
    labels = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "labels"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)

    images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
    labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)

    # images = fn.crop_mirror_normalize(
    #     images, dtype=types.FLOAT, std=[255.], output_layout="CHW")

   ???????? How can I apply crop without normalize on labels ????????
   
    return images, labels


shapes = (
    (args.batch_size, args.crop_height, args.crop_width),
    (args.batch_size))

dtypes = (
    tf.float32,
    tf.int32)


def dataset_fn(input_context):
    with tf.device("/gpu:{}".format(input_context.input_pipeline_id)):
        device_id = input_context.input_pipeline_id
        return dali_tf.DALIDataset(
            pipeline=data_pipeline(
                device_id=device_id, shard_id=device_id),
            batch_size=args.batch_size,
            output_shapes=shapes,
            output_dtypes=dtypes,
            device_id=device_id)


input_options = tf.distribute.InputOptions(
    experimental_place_dataset_on_device=True,
    experimental_fetch_to_device=False,
    experimental_replication_mode=tf.distribute.InputReplicationMode.PER_REPLICA)



strategy = tf.distribute.MirroredStrategy(
    [device.name[-5:] for device in tf.config.experimental.list_physical_devices('GPU')][:args.num_gpus])

train_dataset = strategy.distribute_datasets_from_function(dataset_fn, input_options)
val_batches = strategy.distribute_datasets_from_function(dataset_fn, input_options)

Please let me know If any additional information is needed. Thank you

Aug 18 '22 16:08 PurvangL

Hey @PurvangL, could you elaborate on how do you want crop those images/labels? Cropping without normalizing can be achieved with the fn.crop operator. It has the same set of arguments related to crop as fn.crop_mirror_normalize. I guess you want you want to provide the crop argument to be [args.crop_height, args.crop_width]. If you want to randomize the position of the cropping, you might use the fn.random.uniform operators and pass their outputs to the crop_pos_x and crop_pos_y arguments of the fn.crop_mirror_normalize and fn.crop.

Aug 22 '22 08:08 banasraf

Hi @banasraf Thank you for your reply. I have modified function as below to apply cropping on images and labels.

@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
    pngs = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "images"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)
    labels = fn.readers.caffe2(
        path=os.path.join(args.dataset, "train", "labels"), random_shuffle=True, shard_id=shard_id, num_shards=args.num_gpus)

    images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
    labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)

    # images = fn.crop_mirror_normalize(
    #     images, dtype=types.FLOAT, std=[255.], output_layout="CHW")
    images = fn.crop(images,crop=[args.crop_height, args.crop_width])
    labels = fn.crop(labels,crop=[args.crop_height, args.crop_width])
    # return images, labels.gpu()
    return images, labels

I am getting following error when I run programm.

Traceback (most recent call last):
  File "distributed_training/distributed_training_dali.py", line 344, in <module>
    train_dataset = strategy.distribute_datasets_from_function(dataset_fn, input_options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1160, in distribute_datasets_from_function
    dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 597, in _distribute_datasets_from_function
    options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 168, in get_distributed_datasets_from_function
    input_contexts, dataset_fn, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 1566, in __init__
    input_contexts, self._input_workers, dataset_fn))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/input_lib.py", line 2301, in _create_datasets_from_function_with_input_context
    dataset = dataset_fn(ctx)
  File "distributed_training/distributed_training_dali.py", line 328, in dataset_fn
    device_id=device_id)
  File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/plugin/tf.py", line 768, in __init__
    if _has_external_source(pipeline):
  File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/external_source.py", line 787, in _has_external_source
    pipeline._build_graph()
  File "/usr/local/lib/python3.6/dist-packages/nvidia/dali/pipeline.py", line 615, in _build_graph
    raise TypeError(f"Illegal pipeline output type. The output {i} contains a nested "
TypeError: Illegal pipeline output type. The output 0 contains a nested `DataNode`. Missing list/tuple expansion (*) is the likely cause.

Could you please let me know how can I fix this? Thank you

Aug 22 '22 19:08 PurvangL

@banasraf Is there any additional information that I need to provide? please let me know. Thank you

Aug 24 '22 17:08 PurvangL

@PurvangL Okay, I managed to find the issue. caffe2 reader returns files and labels by default, so the output of the operator is a pair (files, labels). This means that in your case the images is a pair of DataNodes. I guess that in your case you don't have any numeric labels for your images, so you might pass the label_type=4 to the reader (see documentation of the operator). This way the operator will return the images only.

Aug 25 '22 08:08 banasraf

@PurvangL Also, please be sure that the order of labels and images is the same between those two LMDB files. The two readers have the same random seed in your case, so they should generate the same order with random_shuffle but only assuming the same source order in the datasets.

Aug 25 '22 11:08 banasraf

@banasraf Thank you for your answer. This is the simplest example I am trying to work with. After adding your suggestions, I am still have issue running it without any error. Below is the link for my example which I am trying to run.

tf_dali.py file consist all the information including dependencies to be installed and command to run.

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

please let me know if any additional information is needed.

Regards, Purvang

Aug 25 '22 17:08 PurvangL

Hi @PurvangL,

I think the file reader operator is the one you are looking for:

@pipeline_def(batch_size=args.batch_size)
def data_pipeline(shard_id):
    images_path = os.path.join(args.dataset, "train", "images")
    labels_path = os.path.join(args.dataset, "train", "labels")
    images_files = os.listdir(images_path)
    labels_files = os.listdir(labels_path)
    images_files = [os.path.join(images_path, f) for f in images_files]
    labels_files = [os.path.join(labels_path, f) for f in labels_files]
    pngs, _ = fn.readers.file(
        files =images_files,
        random_shuffle=True,
        shard_id=shard_id,
        num_shards=args.num_gpus,
        seed=SEED)

    labels, _ = fn.readers.file(
        files = labels_files,
        random_shuffle=True,
        shard_id=shard_id,
        num_shards=args.num_gpus,
        seed=SEED)

    images = fn.decoders.image(pngs, device='mixed', output_type=types.RGB)
    labels = fn.decoders.image(labels, device='mixed', output_type=types.RGB)

    images = fn.crop(images, crop=[args.crop_height, args.crop_width])
    labels = fn.crop(labels, crop=[args.crop_height, args.crop_width])

    return images, labels

Just make sure that the order of files inside images_files and labels_files matches.

Aug 31 '22 06:08 JanuszL

Thank you @JanuszL . I was able to start training with the changes. But there are two scenarios that I am having difficulties with.

Training Time: Now after adding Dali pipeline operator and training on multiple gpu, each iteration and overall training time now increased to 5x compared to when using tf.data.Dataset object input pipeline. What could be wrong here?
Batch Size: With tf.data.Dataset object input pipeline, I was able to feed 162 global batch size, for 2 gpus. Now I can only use 6 batch size and if I increase, it throws OOM. Does 6 batch size is local and is it for each Gpu, making global batch size of 12?

Thank you Purvang

Sep 01 '22 15:09 PurvangL

Hi @PurvangL,

I would check the epoch length. Maybe there is some issue with setting the number of shards and shard_id in DALI so each GPU goes over the whole data set instead of only a shard that should be assigned to it.
The batch size you set in DALI is per GPU. It is expected that moving data processing to the GPU will limit the memory available for the model, hence you may want to reduce the batch size, however, the overall throughput may increase in some cases.

If you suspect something wrong with the overall performance you may check GPU utilization first and then capture profile.

Sep 01 '22 16:09 JanuszL

@JanuszL Thanks for your reply. Using DALI, there are 59 iterations to complete each epoch, when batch size is 50 (assuming per Gpu based on last message) Therefore Global batch size: 400 and 8 gpus are used. Using tf.data.Dataset with batch size of 400 which is global batch size (per Gpu bs: 50), only 7 iterations were performed to complete one epoch. How can I correct this?

Sep 01 '22 18:09 PurvangL

@PurvangL,

Please make sure that shard_id and num_shards are set accordingly for each GPU. If you can share your script of some minimal repro that just iterates over the data set I can check it.

Sep 01 '22 19:09 JanuszL

Sure @JanuszL

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

This link has both script and as well dataset. Please let me know if there is anything missing. Thank you

Sep 01 '22 19:09 PurvangL

@PurvangL,

I think you need to adjust steps_per_epoch based on the batch size per GPU and the number of GPUs.

Sep 02 '22 11:09 JanuszL

Thank you for your reply @JanuszL . With this change, I was able to get correct iterations per batch. I am using 2 x A100 40Gb Nvidia Gpus and with using DALI pipeline, I am only able to get per gpu batch size of 4 (Global Bs-8), where using tf.data, I can use 421 Global Batch size, and if dataset was bigger, I could go upto 2600 with image resolution of (128,128).

I have question why this big difference in max batch size using DALI?

Below in Gdrive I have uploaded my test result for CamVid dataset comparing DALI and tf.data pipeline. DALI still seems a bit slower and I see difference in val_mean_iou as well. What could be the reason?

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

Please let me know if any additional information needed.

Thank you

Sep 02 '22 16:09 PurvangL

Maybe TensorFlow competes with DALI for the GPU memory. Can you try limiting TF to use only as much memory as it needs?

Sep 02 '22 17:09 JanuszL

@JanuszL Thank you for your reply.

After trial and error, I have limited each Gpu memory to 4GB for tensorflow out of 40GB gpu memory, when I use DALI data loading pipeline. Max batch size, I am able to run with tf.data pipeline is 650 (global batch size) for 4xA100, 40GB gpus and max batch size with DALI Data pipeline is 64 (global batch size) for same hardware.

Also, when I check gpu memory usage using "nvidia-smi", For DALI pipeline, 6867 MiB is used out of 40GB for each gpu and for tf.data pipeline, 39739 MiB is used out of 40GB. Below I have attached my test result.

Is there anything that I am missing or any suggestions for training time improvement?

Thank you

Sep 06 '22 20:09 PurvangL

Hi @PurvangL,

I added:

for gpu in tf.config.list_physical_devices('GPU'):
      tf.config.experimental.set_memory_growth(gpu, True)

at the very beginning of the training script. I can run batch_size 128 with 12GB of GPU memory. Can you try this out on your side?

Sep 07 '22 22:09 JanuszL

Thank you @JanuszL . First of all, based on your previous reply about tensorflow competing gpu memory,

tf.config.experimental.set_memory_growth(gpu, True)

was already present and I changed it to limit the gpu memory for each gpu using

tf.config.set_logical_device_configuration(gpu, [tf.config.LogicalDeviceConfiguration(memory_limit=4000)])

I was also able to fit 300 global batch size using DALI (150 each gpu, total 2 gpu) and trained on Cityscapes dataset, and 5 epochs training time is 2m6.087s . On the other hand, I was able to use 700 global batch size with tf.data, and train time is 1m31.264s. Trying to understand why training time using DALI data loading is slow? What improvements need to make to train faster?

Thank you

Sep 08 '22 02:09 PurvangL

Hi @PurvangL,

Based on https://github.com/NVIDIA/DALI/issues/4179#issuecomment-1238613846 I thought that you cannot use batch size greater than 16. The thing that you may want to check is this blog post, so you learn if your training is really limited by the data processing. Also, your data sets consist of png images and DALI doesn't provide a GPU acceleration for them (we use nvJPEG and nvJPEG2000 to GPU accelerate JPEG and JPEG2000 decoding, but there is no GPU accelerated PNG decoder). So maybe in your case, the GPU is already loaded with work and even moving crop operation there doesn't provide much value. Also, did you use the same number of CPU threads in DALI and tf.data (so we know that we are comparing apples to apples here).

Sep 08 '22 08:09 JanuszL

@JanuszL . Thank you for the answer. point noted regarding the image format.

https://drive.google.com/drive/folders/1iYk1wIRupy34z6bixHdErMX8859lF-UD?usp=sharing

This gdrive location contains both scripts, one with DALI and other one with tf.data.

My goal is to see whether using DALI can reduce training time or not?

Note: I have duplicated the images to test with higher batch size as my main goal is to reduce the training time. I have not updated gdrive dataset with duplicated images and it still contains ~421 training images.

Please let me know if anything needed from my end.

Sep 08 '22 18:09 PurvangL

Hi @PurvangL,

Can you capture the profile according to this guide and see what the timeline looks like?

Sep 09 '22 07:09 JanuszL

Hi @JanuszL I have rerun both the script with profiling. I have added profiling data in mentioned google drive link. Could you please checkout and explain your analysis and also recommended steps?

Thank you

Sep 09 '22 17:09 PurvangL

@PurvangL,

The main difference I see is that TF uses all available CPU cores while DALI sticks to the default value in your case. You can use num_threads in the DALIDataset so assign more threads to DALI to speed up the png decoding (which is CPU bound). The rest of the processing is almost negligible and there is a little speed up from moving it to the GPU (in your case it is only about crop).

Sep 09 '22 18:09 JanuszL

@JanuszL . Indeed that solved the problem. So in summary, I was trying to compare

4xA100 (80GB each) PCIe machine
4xA100 (80GB each) SXM4 machine and compare them for training time and inference time. Using MLperf submission by Nvidia, SXM machine was definitely faster, but when I used with other algorithms not submitted in MLperf, the result was other way. By analyzing profiling result, SXM machine was taking more time for kernel launch (tf profiling) and memcpy operations (ncu profiling)(Any reason why?), which I felt could be the reason. Integrating DALI to algorithm did help and made SXM machine training faster. Any other tricks or feature that I could integrate or you recommend to use SXM machine for it's full potential?

related issue:

https://forums.developer.nvidia.com/t/file-not-found-cuvectoraddmulti-exe/227440/2

Thank you

Sep 13 '22 23:09 PurvangL

Hi @PurvangL,

Can you check if the CPU and GPU clocks are set the same way on both machines? For example, this GPU talk can provide some guidance.

Sep 14 '22 00:09 JanuszL

@JanuszL Thank you for the reply. I will. Meanwhile I am facing new problem after integrating DALI. Getting poor results (both training time as well as mean_iou) when increasing batch size or number of gpus for training. Below is my test result. I am linearly scaling learning rate with number of gpus.

what could be the reason?

Thanks

Sep 21 '22 19:09 PurvangL

Hi @PurvangL,

Can you confirm that the number of iterations decreases with the increase of the batch size of the number of GPUs? Can you also do a dry run to make sure that all dataset samples are returned during one epoch?

Sep 22 '22 05:09 JanuszL

Hi @JanuszL Yes, I confirm with increasing batch size in number of GPUS, iterations does reduces linearly. I have two questions here. when I use num_threads = cpu_count(),

in same machine, increasing batch size with number of Gpu taking more time for same dataset and for same number of epochs. I confirm iteration reduces linearly with increment in batch size and Gpu.
machine with 2x Gpu memory and double batch size taking more time.

below I have attached my result.

when I use num_threads = 64, result does change, but 8 gpu is still concern.

Do I need to find optimal num_threads ?

Thank you

Sep 23 '22 23:09 PurvangL

Hi @PurvangL,

Can you check what is the iteration time? How does it change with the number of GPUs (I would expect it to remain similar)? Can you also tell what is in the Total_Epochs column? I would expect that the number of epochs (time you train over the whole data set) remains the same compared to the number of GPUs used and the batch size.

Sep 26 '22 08:09 JanuszL

Hi @JanuszL . Thank you for your reply. I am currently facing issue of running for longer time when increasing number of GPUs and batch size in 8xA100 80G server. except this, all servers (PCIe, SXM) are getting result as expected.

Iteration time: 2 GPUs : 7s 4 GPUs : 5s 8 GPUs : 5s

Should be 8 GPUs iteration time less compared to 4 Gpus?

Sep 27 '22 16:09 PurvangL