CV-CUDA [QUESTION] Does CV-CUDA support for multigpu?

Hi, I want to use this great work in torch based distributed training to speed up, it works well when only use single gpu, but when use more than one gpu, it get crash and get the error as following: terminate called after throwing an instance of 'pybind11::error_already_set' what(): ValueError: Hold resources failed: cudaErrorInvalidResourceHandle: invalid resource handle I have tried to print some info to debug this problem, it can be found that all things is fine in rank_0, but cvcuda get crash in rank_1,

the main code is shown as below: `

 # Define the cuda device, context and streams.
        cuda_device = cuda.Device(self.rank)
        cuda_ctx = cuda_device.retain_primary_context()
        cuda_ctx.push()
        cvcuda_stream = cvcuda.Stream().current
        torch_stream = torch.cuda.default_stream(device=cuda_device)

        print(f'rank_{self.rank} start train, cvcuda stream: {cvcuda_stream}, torch_stream: {torch_stream}')
        self.data_preprocessor = PreprocessorCvcuda(
            self.rank, 
            cuda_ctx,
            cvcuda_stream,
        )

        #  Do everything in streams.
        with cvcuda_stream, torch.cuda.stream(torch_stream):
            self.train(train_dataloaders, test_dataloaders, iterations=iterations)
            cuda_ctx.pop()
`
````python
class ImageBatchDecoder:
    def __init__(
        self,
        device_id,
        cuda_ctx,
        cuda_stream,
        cvcuda_perf=None,
    ):
        self.device_id = device_id
        self.cuda_ctx = cuda_ctx
        self.cuda_stream = cuda_stream
        self.cvcuda_perf = cvcuda_perf
        self.decoder = nvimgcodec.Decoder(device_id=device_id)

    def __call__(self, batch: list, aug_params: dict):
        # args: 
        #   batch: batch of undecoded images bytes
        if self.cvcuda_perf is not None:
            self.cvcuda_perf.push_range("decoder.nvimagecodec")

        data_batch = [img for frame in batch for img in frame]

        tensor_list = []
        print(f'rank_{self.device_id} start decode, stream: {self.cuda_stream}...', flush=True)
        image_list = self.decoder.decode(data_batch, cuda_stream=self.cuda_stream)
        print(f'rank_{self.device_id} end decode...', flush=True)

        resize = aug_params['resize'].view(-1, 2).cpu().numpy()
        crop = aug_params['crop'].view(-1, 4).cpu().numpy()
        rotate = aug_params['rotate'].view(-1).cpu().numpy()
        rotate_rad = rotate * 3.1415926535897932384626433832795 / 180
        sin_r = np.sin(rotate_rad)
        cor_r = np.cos(rotate_rad)
        # Convert the decoded images to nvcv tensors in a list.
        for i in range(len(image_list)):
            print(f'rank_{self.device_id} start resize_crop_convert_reformat...', flush=True)
            aug_img = cvcuda.resize_crop_convert_reformat(
                cvcuda.as_tensor(image_list[i], "HWC"),
                (resize[i, 0], resize[i, 1]),
                cvcuda.Interp.LINEAR,
                cvcuda.RectI(
                    crop[i, 0], 
                    crop[i, 1], 
                    round(crop[i, 2] - crop[i, 0]), 
                    round(crop[i, 3] - crop[i, 1])),
                layout="HWC",
                data_type=nvcv.Type.U8,
                # manip=cvcuda.ChannelManip.REVERSE,
                # scale=1. / 255,
                stream=self.cuda_stream,
            )
            print(f'rank_{self.device_id} start rotate...', flush=True)
            aug_img = cvcuda.rotate(
                aug_img,
                rotate[i],
                [0.5 * (aug_img.shape[1] - aug_img.shape[1] * cor_r[i] - aug_img.shape[0] * sin_r[i]),
                 0.5 * (aug_img.shape[0] + aug_img.shape[1] * sin_r[i] - aug_img.shape[0] * cor_r[i])], 
                cvcuda.Interp.LINEAR,
                stream=self.cuda_stream,
            )
            tensor_list.append(aug_img)

        # Stack the list of tensors to a single NHWC tensor and convert to NCHW.
        print(f'rank_{self.device_id} start reformat...', flush=True)
        cvcuda_decoded_tensor = cvcuda.reformat(cvcuda.stack(tensor_list), "NCHW", stream=self.cuda_stream)

        if self.cvcuda_perf is not None:
            self.cvcuda_perf.pop_range()
        print(f'rank_{self.device_id} end of ImageBatchDecoder...', flush=True)
        return cvcuda_decoded_tensor

Oct 18 '24 07:10 zhengjs

Hi @zhengjs,

While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training. Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?

Oct 18 '24 07:10 JanuszL

Hi @zhengjs,

While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training. Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?

@JanuszL Thank you for your reply! Yes, I have considered to use DALI, but I think it's a little complicated，I have to refactor many dataset code, so I use CV-CUDA. In fact, I didn't use CV-CUDA in dataset, I use it at the begining of each iteration, dataloader only read image bytes, before model forward, use nvImageCodec and CV-CUDA to do decode ang augmentation on gpu. I think the problem may be that cvcuda.Stream().current hasn't specified the device_id, but I don't found any code to do that...

Oct 18 '24 08:10 zhengjs

cv-cuda use vector to cache item for reuse，the problem is： when use the second gpu，it return the resource belong to previous gpu. reformat related function: "Tensor::CreateFromReqs" and "CreateOperatorEx"

Jan 08 '25 10:01 Novelian

@zhengjs @JanuszL add device_id as resource(tesnsor)'s key or create multiple resource collections depending on the graphics card

Jan 10 '25 01:01 Novelian

@zhengjs I found a solution: call cudart.cudaSetDevice from cuda-python before creating an NVCV stream. NVCV uses cudaStreamCreateWithFlags to create its CUDA stream.

Apr 01 '25 03:04 AlphaCat00

@Novelian @AlphaCat00 CVCUDA samples and benchmarking scripts can run on multiple GPUs. In fact the Readme talks about multi-gpu launches. Theses lines in benchmark.py launches any CVCUDA python sample on more than 1 GPU at the same time. It launches them as sub-process, with each of them having a difference CUDA devide pre-allocated with CUDA_VISIBLE_DEVICES set to only the GPU on which that sup-process is going to execute. Take a look at these lines. That would make sure that only particular GPU becomes visible to that process and that it shows up as GPU ID 0 to that process.

Let me know if this answers your question.

Apr 03 '25 21:04 dsuthar-nvidia

@dsuthar-nvidia Thank you for your reply. how to use multi gpu in multiprocessing.Process( os.environ["CUDA_VISIBLE_DEVICES"] not work in subprocess)?

Apr 10 '25 06:04 Novelian

I encountered a similar problem as you. I tried to scale the image on GPU 1, but the returned tensor variable was stored on GPU 0 by default, which triggered an error, "terminate called after throwing an instance of 'pybind11::error_already_set'"

Jul 18 '25 17:07 Soleilor