Error with GPU-only Image Decoding in NVIDIA DALI Pipeline
Describe the question.
I’m encountering an issue while running a DALI pipeline with GPU-only decoding. The pipeline works when the fn.decoders.image operator is set to "mixed" mode, but it fails with device="gpu" mode, throwing an error about incompatible device storage for the input. Here’s the setup and error details:
Code:
class SimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id, external_data):
super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
self.input = fn.external_source(source=external_data, num_outputs=2, dtype=[types.UINT8, types.INT32])
def define_graph(self):
self.jpegs, self.labels = self.input
# This works:
# self.decode = fn.decoders.image(self.jpegs, device="mixed", output_type=types.RGB)
# This fails with incompatible device storage error:
self.decode = fn.decoders.image(self.jpegs, device="gpu", output_type=types.RGB)
self.resize = fn.resize(self.decode, device="gpu", resize_x=1120, resize_y=640)
self.cmnp = fn.crop_mirror_normalize(
self.resize, device="gpu", dtype=types.FLOAT, output_layout="CHW",
crop=(640, 1120), mean=[0.0, 0.0, 0.0], std=[255.0, 255.0, 255.0]
)
return self.cmnp, self.labels
pipe = SimplePipeline(batch_size=batch_size, num_threads=32, device_id=0, external_data=iter)
pipe.build()
Error:
RuntimeError: Assert on "IsCompatibleDevice(dev, inp_dev, op_type)" failed:
The input 0 for gpu operator nvidia.dali.fn.decoders.image is stored on incompatible device "cpu". Valid device is "gpu".
GPU and Platform Information:
GPU: NVIDIA RTX 6000 Ada Generation
CUDA Version: 12.2
DALI Version: [specify DALI version if known]
Driver Version: 535.104.05
System: Running in a Docker container with NVIDIA GPU support enabled
CUFile GDS Check: Here are the results from running gdscheck:
plaintext
(base) ➜ tools ./gdscheck -p
warn: error opening log file: Permission denied, logging will be disabled
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 1024
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 18014398509481980
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 1024
execution.max_request_parallelism : 0
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):65536, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Platform verification succeeded
(base) ➜ tools
Additional Notes: The pipeline works when device="mixed" is used for fn.decoders.image, but switching to device="gpu" causes the error. I’m using external data for fn.external_source, which may be causing the device compatibility issue. The goal is to decode directly on the GPU to optimize performance.
Check for duplicates
- [x] I have searched the open bugs/issues and have found no duplicates for this bug report
Hi @aafaqin,
Thank you for reaching out.
You can read more about the meaning of the operator backend here. The mixed backed is used for operators that consume the input from the CPU and produce the output on the GPU. The decoder operator does only support cpu and mixed backends, so the encoded images should be located on the CPU first. There is no variant available that can read data directly from the GPU memory. The rationale is that while most of the decoding process can be performed on the GPU, there are initial stages that need to happen or are just more efficient on the CPU (bitstream parse, Huffman coefficients decoding).
Hi @JanuszL ,
Thank you for the clarification. I understand now that the mixed mode is essential for handling initial decoding stages on the CPU before GPU processing can take place. Given our intent to enhance performance through GPU utilization, I am curious if we can integrate GPU Direct Storage (GDS) with DALI to streamline data transfers directly from storage to GPU memory, bypassing the CPU to accelerate the workflow. Could this approach mitigate the need for CPU involvement in the initial decoding steps, or would it be feasible to adjust the pipeline to support such a configuration?
Additionally, we are exploring methods for writing to disk with JPEG compression and are considering the use of nvjpeg combined with cufile for efficient disk writing. Do you suggest this approach, or is there an alternative method within DALI or NVIDIA's libraries that would better suit our needs?
Looking forward to your insights.
Best regards
Hi @aafaqin,
I am curious if we can integrate GPU Direct Storage (GDS) with DALI to streamline data transfers directly from storage to GPU memory, bypassing the CPU to accelerate the workflow.
I'm afraid this is not currently possible as the decoding process requires some work to happen on the CPU first (stream parsing, and, in the case of a hybrid approach, not HW decoding, Huffman coefficients decoding).
Additionally, we are exploring methods for writing to disk with JPEG compression and are considering the use of nvjpeg combined with cufile for efficient disk writing. Do you suggest this approach, or is there an alternative method within DALI or NVIDIA's libraries that would better suit our needs?
DALI hasn't approached the encoding yet, technically it should be feasible however I'm not sure if the encoded images end up in the CPU or GPU memory. You may try using nvImageCodec for decoding and kvikio for GDS access.
Thanks for the help so far on the same code i am trying out different ways like
class SimplePipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id, external_data):
super(SimplePipeline, self).__init__(batch_size, num_threads, device_id, seed=12)
# self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32])
self.input = fn.external_source(source=external_data, num_outputs=2,dtype=[types.UINT8, types.INT32],parallel=True,prefetch_queue_depth=16,batch=True)
def define_graph(self):
self.jpegs, self.labels = self.input
self.decode = fn.decoders.image(self.jpegs,device="mixed", output_type=types.RGB)
self.resize = fn.resize(self.decode,device="gpu", resize_x=1600, resize_y=1600)
# self.prem = fn.transpose(self.resize, perm=[2,0,1],dtype=types.FLOAT)
self.cmnp = fn.crop_mirror_normalize(self.resize,device="gpu",
dtype=types.FLOAT,
output_layout="CHW",
crop=(1600,1600),
mean=[0.0,0.0,0.0],
std=[255.0,255.0,255.0])
return self.cmnp ,self.labels
Still my CPU core is just 1 CPU core being used(100% utilisation) i have a 64 core CPU how to spread it.
Hi @aafaqin,
Still my CPU core is just 1 CPU core being used(100% utilisation) i have a 64 core CPU how to spread it.
It means you use only 1 DALI thread (see num_threads value) or the batch size is 1. Can you set num_threads=10 and batch size 256 for example and see if that makes any difference?
I've set the num_threads in the DALI pipeline to match the number of CPU cores (64 in my case) and verified the DALI_AFFINITY_MASK. Despite this, I am not seeing any significant performance improvement when increasing the batch size. The average processing speed per image remains unchanged, regardless of adjustments to the batch size.
Do you have any insights on what could be causing this bottleneck? Could it be related to how external inputs are being processed or perhaps the GPU-CPU synchronization? Any suggestions to optimize this further would be greatly appreciated.
Can you try capturing the profile of the processing using nsight and see how it looks like/share?