decord Compatibility with PyTorch DataLoader

Hi,

I tried to combine decord's GPU decoding with PyTorch's DataLoader. The code snippet looks like the following:

class VideoDataSet(torch.utils.data.Dataset):
    ...
    def __getitem__(self, index):
        decord.bridge.set_bridge('torch')
        vr = decord.VideoReader(self.video_list[index], ctx=decord.gpu(int(os.getenv('LOCAL_RANK'))))
        frames = vr.get_batch(indices)
        return frames

if __name__ == '__main__':
    torch.multiprocessing.set_start_method("spawn")  # avoid CUDA initialization error
    dataset = VideoDataSet("kinetics/K700/video_raw", "kinetics/K700/k700_train_video.txt")
    loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=16, num_workers=16)

The following information shows up multiple times in the training log but no input data is consumed. Reducing num_workers to 1 or even 0 doesn't change the situation.

[15:34:05] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:07] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:12] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:12] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.

decord's GPU decoding, when directly used without Dataset or DataLoader, works well in the same environment.

Jun 28 '20 08:06 bryandeng

UPDATE:

This seems a memory-related issue. When using a small model and setting batch_size=1 and num_workers=1, training runs well. But increasing batch_size or num_workers easily leads to this error

CUDA error 2 at line 176 in file /decord/src/video/nvcodec/cuda_decoder_impl.cc: out of memory

Jun 30 '20 03:06 bryandeng

I don't think you will benefit from gpu decoding if you use num_worker>0 for data loading because it's using multiple cpu core to speed up the loading which will significantly increase the gpu memory(each process will occupy some cuda init memory).

I don't particularly clear how pytorch multiprocessing is handling the tensors in between cpu/gpu, but very likely it will copy the tensor from gpu(decoded), batched, then copy to gpu again, so you better use cpu for video decoding in this case.

Jun 30 '20 22:06 zhreshold

UPDATE:

This seems a memory-related issue. When using a small model and setting batch_size=1 and num_workers=1, training runs well. But increasing batch_size or num_workers easily leads to this error
CUDA error 2 at line 176 in file /decord/src/video/nvcodec/cuda_decoder_impl.cc: out of memory

You'd better try num_workers=0 to avoid multiprocessing DataLoader.

Jul 03 '20 02:07 DelightRun

I don't think you will benefit from gpu decoding if you use num_worker>0 for data loading because it's using multiple cpu core to speed up the loading which will significantly increase the gpu memory(each process will occupy some cuda init memory).

I don't particularly clear how pytorch multiprocessing is handling the tensors in between cpu/gpu, but very likely it will copy the tensor from gpu(decoded), batched, then copy to gpu again, so you better use cpu for video decoding in this case.

Sorry, but I'm confused that why data will be copied to GPU again after batched?

Jul 05 '20 13:07 DelightRun

@DelightRun Are you using gpu for training or inference? If so, data need to be copied to gpu again from batch input.

Jul 12 '20 23:07 zhreshold

@DelightRun Are you using gpu for training or inference? If so, data need to be copied to gpu again from batch input.

@zhreshold After I read PyTorch's code, the answer is no. PyTorch use CUDA IPC to share data between processes by default, so GPU-CPU memory copy is not necessary as long as the data store on the same GPU.

Jul 13 '20 03:07 DelightRun

The Dataloader in pytorch prefetches a lot of videos: prefetch_factor x num_workers x batch_size (prefetch_factor = 2 by default). These videos will be stored in GPU that consumes a lot of memory. You may want to reduce the hyper-parameters above. Also, because of the prefetching, the data loading in pytorch could be pretty fast where decoding is not the bottleneck. In my experiment, the gpu decoding is slower than the cpu one as the gpu in this case may have to handle a lot of unnecessary jobs that could have been handled in CPU asynchronously.

Jun 29 '22 22:06 fwtan

decord decord copied to clipboard

Compatibility with PyTorch DataLoader

decord
decord copied to clipboard