[QUESTION] Does CV-CUDA support for multigpu?
Hi, I want to use this great work in torch based distributed training to speed up, it works well when only use single gpu, but when use more than one gpu, it get crash and get the error as following:
terminate called after throwing an instance of 'pybind11::error_already_set' what(): ValueError: Hold resources failed: cudaErrorInvalidResourceHandle: invalid resource handle
I have tried to print some info to debug this problem, it can be found that all things is fine in rank_0, but cvcuda get crash in rank_1,
the main code is shown as below: `
# Define the cuda device, context and streams.
cuda_device = cuda.Device(self.rank)
cuda_ctx = cuda_device.retain_primary_context()
cuda_ctx.push()
cvcuda_stream = cvcuda.Stream().current
torch_stream = torch.cuda.default_stream(device=cuda_device)
print(f'rank_{self.rank} start train, cvcuda stream: {cvcuda_stream}, torch_stream: {torch_stream}')
self.data_preprocessor = PreprocessorCvcuda(
self.rank,
cuda_ctx,
cvcuda_stream,
)
# Do everything in streams.
with cvcuda_stream, torch.cuda.stream(torch_stream):
self.train(train_dataloaders, test_dataloaders, iterations=iterations)
cuda_ctx.pop()
`
````python
class ImageBatchDecoder:
def __init__(
self,
device_id,
cuda_ctx,
cuda_stream,
cvcuda_perf=None,
):
self.device_id = device_id
self.cuda_ctx = cuda_ctx
self.cuda_stream = cuda_stream
self.cvcuda_perf = cvcuda_perf
self.decoder = nvimgcodec.Decoder(device_id=device_id)
def __call__(self, batch: list, aug_params: dict):
# args:
# batch: batch of undecoded images bytes
if self.cvcuda_perf is not None:
self.cvcuda_perf.push_range("decoder.nvimagecodec")
data_batch = [img for frame in batch for img in frame]
tensor_list = []
print(f'rank_{self.device_id} start decode, stream: {self.cuda_stream}...', flush=True)
image_list = self.decoder.decode(data_batch, cuda_stream=self.cuda_stream)
print(f'rank_{self.device_id} end decode...', flush=True)
resize = aug_params['resize'].view(-1, 2).cpu().numpy()
crop = aug_params['crop'].view(-1, 4).cpu().numpy()
rotate = aug_params['rotate'].view(-1).cpu().numpy()
rotate_rad = rotate * 3.1415926535897932384626433832795 / 180
sin_r = np.sin(rotate_rad)
cor_r = np.cos(rotate_rad)
# Convert the decoded images to nvcv tensors in a list.
for i in range(len(image_list)):
print(f'rank_{self.device_id} start resize_crop_convert_reformat...', flush=True)
aug_img = cvcuda.resize_crop_convert_reformat(
cvcuda.as_tensor(image_list[i], "HWC"),
(resize[i, 0], resize[i, 1]),
cvcuda.Interp.LINEAR,
cvcuda.RectI(
crop[i, 0],
crop[i, 1],
round(crop[i, 2] - crop[i, 0]),
round(crop[i, 3] - crop[i, 1])),
layout="HWC",
data_type=nvcv.Type.U8,
# manip=cvcuda.ChannelManip.REVERSE,
# scale=1. / 255,
stream=self.cuda_stream,
)
print(f'rank_{self.device_id} start rotate...', flush=True)
aug_img = cvcuda.rotate(
aug_img,
rotate[i],
[0.5 * (aug_img.shape[1] - aug_img.shape[1] * cor_r[i] - aug_img.shape[0] * sin_r[i]),
0.5 * (aug_img.shape[0] + aug_img.shape[1] * sin_r[i] - aug_img.shape[0] * cor_r[i])],
cvcuda.Interp.LINEAR,
stream=self.cuda_stream,
)
tensor_list.append(aug_img)
# Stack the list of tensors to a single NHWC tensor and convert to NCHW.
print(f'rank_{self.device_id} start reformat...', flush=True)
cvcuda_decoded_tensor = cvcuda.reformat(cvcuda.stack(tensor_list), "NCHW", stream=self.cuda_stream)
if self.cvcuda_perf is not None:
self.cvcuda_perf.pop_range()
print(f'rank_{self.device_id} end of ImageBatchDecoder...', flush=True)
return cvcuda_decoded_tensor
Hi @zhengjs,
While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training. Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?
Hi @zhengjs,
While CV-CUDA and nvImageCodec work great for inference, they may not be well suited to the multiprocess data loading approach PyTorch applies for training. Have you heard and consider using DALI for that purpose which provides seamless integration with PyTorch?
@JanuszL Thank you for your reply! Yes, I have considered to use DALI, but I think it's a little complicated,I have to refactor many dataset code, so I use CV-CUDA. In fact, I didn't use CV-CUDA in dataset, I use it at the begining of each iteration, dataloader only read image bytes, before model forward, use nvImageCodec and CV-CUDA to do decode ang augmentation on gpu. I think the problem may be that cvcuda.Stream().current hasn't specified the device_id, but I don't found any code to do that...
cv-cuda use vector to cache item for reuse,the problem is: when use the second gpu,it return the resource belong to previous gpu. reformat related function: "Tensor::CreateFromReqs" and "CreateOperatorEx"
@zhengjs @JanuszL add device_id as resource(tesnsor)'s key or create multiple resource collections depending on the graphics card
@zhengjs I found a solution: call cudart.cudaSetDevice from cuda-python before creating an NVCV stream. NVCV uses cudaStreamCreateWithFlags to create its CUDA stream.
@Novelian @AlphaCat00 CVCUDA samples and benchmarking scripts can run on multiple GPUs. In fact the Readme talks about multi-gpu launches. Theses lines in benchmark.py launches any CVCUDA python sample on more than 1 GPU at the same time. It launches them as sub-process, with each of them having a difference CUDA devide pre-allocated with CUDA_VISIBLE_DEVICES set to only the GPU on which that sup-process is going to execute. Take a look at these lines. That would make sure that only particular GPU becomes visible to that process and that it shows up as GPU ID 0 to that process.
Let me know if this answers your question.
@dsuthar-nvidia Thank you for your reply. how to use multi gpu in multiprocessing.Process( os.environ["CUDA_VISIBLE_DEVICES"] not work in subprocess)?
I encountered a similar problem as you. I tried to scale the image on GPU 1, but the returned tensor variable was stored on GPU 0 by default, which triggered an error, "terminate called after throwing an instance of 'pybind11::error_already_set'"