decord
decord copied to clipboard
Compatibility with PyTorch DataLoader
Hi,
I tried to combine decord's GPU decoding with PyTorch's DataLoader. The code snippet looks like the following:
class VideoDataSet(torch.utils.data.Dataset):
...
def __getitem__(self, index):
decord.bridge.set_bridge('torch')
vr = decord.VideoReader(self.video_list[index], ctx=decord.gpu(int(os.getenv('LOCAL_RANK'))))
frames = vr.get_batch(indices)
return frames
if __name__ == '__main__':
torch.multiprocessing.set_start_method("spawn") # avoid CUDA initialization error
dataset = VideoDataSet("kinetics/K700/video_raw", "kinetics/K700/k700_train_video.txt")
loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=16, num_workers=16)
The following information shows up multiple times in the training log but no input data is consumed.
Reducing num_workers to 1 or even 0 doesn't change the situation.
[15:34:05] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:07] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:12] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:12] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:13] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:36: Using device: Tesla V100-SXM2-32GB
[15:34:14] /decord/src/video/nvcodec/cuda_threaded_decoder.cc:56: Kernel module version 418.116, so using our own stream.
decord's GPU decoding, when directly used without Dataset or DataLoader, works well in the same environment.
UPDATE:
This seems a memory-related issue.
When using a small model and setting batch_size=1 and num_workers=1, training runs well.
But increasing batch_size or num_workers easily leads to this error
CUDA error 2 at line 176 in file /decord/src/video/nvcodec/cuda_decoder_impl.cc: out of memory
I don't think you will benefit from gpu decoding if you use num_worker>0 for data loading because it's using multiple cpu core to speed up the loading which will significantly increase the gpu memory(each process will occupy some cuda init memory).
I don't particularly clear how pytorch multiprocessing is handling the tensors in between cpu/gpu, but very likely it will copy the tensor from gpu(decoded), batched, then copy to gpu again, so you better use cpu for video decoding in this case.
UPDATE:
This seems a memory-related issue. When using a small model and setting
batch_size=1andnum_workers=1, training runs well. But increasingbatch_sizeornum_workerseasily leads to this errorCUDA error 2 at line 176 in file /decord/src/video/nvcodec/cuda_decoder_impl.cc: out of memory
You'd better try num_workers=0 to avoid multiprocessing DataLoader.
I don't think you will benefit from gpu decoding if you use
num_worker>0for data loading because it's using multiple cpu core to speed up the loading which will significantly increase the gpu memory(each process will occupy some cuda init memory).I don't particularly clear how pytorch multiprocessing is handling the tensors in between cpu/gpu, but very likely it will copy the tensor from gpu(decoded), batched, then copy to gpu again, so you better use cpu for video decoding in this case.
Sorry, but I'm confused that why data will be copied to GPU again after batched?
@DelightRun Are you using gpu for training or inference? If so, data need to be copied to gpu again from batch input.
@DelightRun Are you using gpu for training or inference? If so, data need to be copied to gpu again from batch input.
@zhreshold After I read PyTorch's code, the answer is no. PyTorch use CUDA IPC to share data between processes by default, so GPU-CPU memory copy is not necessary as long as the data store on the same GPU.
The Dataloader in pytorch prefetches a lot of videos: prefetch_factor x num_workers x batch_size (prefetch_factor = 2 by default). These videos will be stored in GPU that consumes a lot of memory. You may want to reduce the hyper-parameters above. Also, because of the prefetching, the data loading in pytorch could be pretty fast where decoding is not the bottleneck. In my experiment, the gpu decoding is slower than the cpu one as the gpu in this case may have to handle a lot of unnecessary jobs that could have been handled in CPU asynchronously.