decord
decord copied to clipboard
[Query] Using ctx=gpu performance difference. Is this expected?
Hi. Thanks for the amazing repository.
I installed using -DUSE_CUDA option and then tried the example here (https://github.com/zhreshold/decord/blob/master/examples/video_loader.ipynb). I averaged over ten runs using %time Walltime output. The statement I timed was
vl = de.VideoLoader(videos, ctx=ctx, shape=shape, interval=interval, skip=skip, shuffle=0)
| cpu | gpu |
|---|---|
| 53 | 40 |
I also tried across various shuffle strategies, but nearly all of them were the same when the same device is used.
Wondering if this is what is expected.
GPU frames need to be copied to CPU before display so that can be an considerable overhead. During training, if you are going to consume these frames directly in GPU, it saves twice the traffic:
t(CPU->GPU) - t(GPU->CPU)
Does it make sense?
I only used %time for the line vl = de.VideoLoader(videos, ctx=ctx, shape=shape, interval=interval, skip=skip, shuffle=0). If I understand correctly (let me know if I am incorrect), that shouldn't require transfer of frames from gpu to cpu.
Ok, the line you pointed actually does nothing but instantiate an instance of videoLoader, basically only some header and proprocessing is done. You need to measure the real time elapsed by reading frames.