Expected GPU utilization pattern for video decoding.

Open ivandariojr opened this issue 1 year ago • 1 comments

Describe the question.

During optimization of a training pipeline I observed a GPU utilization pattern that would suggest that the DALI pipeline and the training code are running sequentially rather than in parallel as I would have expected from reading the DALI documentation.

Here you can see that model is training when the GPU cuda utilization (blue) spikes. But during training the GPU decoding utilization (green) stops. It also seems like there is very low GPU decoding utilization. Is there some way to ensure the decoder is always running so the downstream model doesnt stop training? This is a a very simple GPU video processing pipeline in Dali that decodes, resizes and then pads videos. I am using this pipeline to train a downstream model. Here are some of the parameters being used to configure this pipeline:

dataloader:
  num_devices: 8
  last_batch_policy: nvidia.dali.plugin.base_iterator.LastBatchPolicy.PARTIAL
  video_pipeline_partial:
    _target_: nvidia.dali.pipeline_def
    _partial_: true
    batch_size: 272
    num_threads: 8
    py_num_workers: 16
    exec_dynamic: true
  video_resize_fn:
    _target_: nvidia.dali.fn.readers.video_resize
    _partial_: true
    device: "gpu"
    sequence_length: 17
    max_size: 256
    resize_longer: 256
    file_list_include_preceding_frame: true
    prefetch_queue_depth: 4
    pad_sequences: false
    pad_last_batch: false
    random_shuffle: true
    stick_to_shard: true
    minibatch_size: 128
  pad_fn:
    _target_: nvidia.dali.fn.pad
    _partial_: true
    fill_value: 0
    axes: [1,2]
    shape:
      - 256
      - 256

And there is the python code where these are being used:

class DaliVideoDataloader(DALIGenericIterator):
    """A DALI dataloader for video data."""

    def __init__(
        self,
        dataset: Dataset,
        video_pipeline_partial: Callable,
        video_resize_fn: Callable,
        pad_fn: Callable,
        device_id: int,
        num_devices: int,
        # DaliGenericIterator args
        size: int = -1,
        auto_reset: bool = False,
        last_batch_padded: bool = False,
        last_batch_policy: LastBatchPolicy = LastBatchPolicy.FILL,
        prepare_first_batch: bool = True,
    ) -> None:

        def video_pipeline(filenames: list[str | Path]) -> Any:
            video = video_resize_fn(filenames=filenames, name="Reader", num_shards=num_devices, shard_id=device_id)
            padded = pad_fn(video)
            return padded

        pipeline = video_pipeline_partial(fn=video_pipeline, device_id=device_id)
        pipe = pipeline(dataset.data_files)
        super().__init__(
            pipelines=[pipe],
            size=size,
            reader_name="Reader",
            auto_reset=auto_reset,
            last_batch_padded=last_batch_padded,
            last_batch_policy=last_batch_policy,
            prepare_first_batch=prepare_first_batch,
            output_map=["video"],
        )

    def __next__(self) -> torch.Tensor:  # pyright: ignore
        """Returns the next video tensor."""
        out = super().__next__()
        return out[0]["video"]

If this is expected behavior that is fine but I am trying to make sure that there isn't a flag or misconfiguration that is causing this performance.

Thanks for your help!

Check for duplicates

[x] I have searched the open bugs/issues and have found no duplicates for this bug report

Dec 28 '24 02:12 ivandariojr

Hi @ivandariojr,

Thank you for reaching out. The utilization plots you showed are good place to start more thorough analysis. I recommend capturing the profile using NSIght System to learn more details. It may happen that there is a piece of CPU code in the training that stalls the GPU work, and DALI is not a bottleneck but provide the data at the peace the training can consume it.

Dec 30 '24 08:12 JanuszL