SoftRas Multi-gpu training

Did you train your model with multiple GPUs? When I train my model with your module in multi-gpu environment, it encounters an error as below. I used nn.DataParallel to wrap my model for multi-gpu training.

RuntimeError: CUDA error: an illegal memory access was encountered (block at /opt/conda/conda-bld/pytorch_1544176307774/work/aten/src/ATen/cuda/CUDAEvent.h:96)

Can you give me some help?

May 09 '19 16:05 mks0601

Hi

Can you provide a script of your code? I will check it out!

May 09 '19 17:05 ShichenLiu

I just used your example code (example/demo_render.py).

I added model class as below.

class Model(nn.Module):

    def __init__(self, renderer):
        super(Model, self).__init__()
        self.renderer = renderer
        
    def forward(self, mesh, camera_distance, elevation, azimuth):
        self.renderer.transform.set_eyes_from_angles(camera_distance, elevation, azimuth)
        images = self.renderer.render_mesh(mesh)

        return images

And defined a model with torch.nn.DataParallel after defining renderer.

model = torch.nn.DataParallel(Model(renderer)).cuda()

In the loop, I changed those lines

renderer.transform.set_eyes_from_angles(camera_distance, elevation, azimuth)
images = renderer.render_mesh(mesh)

into

imgaes = model(mesh, camera_distance, elevation, azimuth).

All others are the same.

May 09 '19 17:05 mks0601

Hi,

I have slightly changed the code. I suppose the problem is because previous code did not specify the cuda devices in soft_rasterizer. Maybe it would fix the bug.

Jun 05 '19 03:06 ShichenLiu

Just Wondering, does the fix solve your problem? @mks0601

Feb 05 '20 08:02 ReNginx

Doesn't seem so for me.

Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/aluo/tools/SoftRas/soft_renderer/renderer.py", line 102, in forward
    return self.render_mesh(mesh, mode)
  File "/home/aluo/tools/SoftRas/soft_renderer/renderer.py", line 96, in render_mesh
    mesh = self.lighting(mesh)
  File "/home/aluo/anaconda3/envs/torch140/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/aluo/tools/SoftRas/soft_renderer/lighting.py", line 57, in forward
    mesh.textures = mesh.textures * light[:, :, None, :]
RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 3

I tried a similar strategy for DIB-R and got memory errors.

Feb 10 '20 18:02 aluo-x

I somehow fixed this issue. Could you check all your tensors are cuda type and some out-of-range index problem?

Feb 10 '20 23:02 mks0601

Looking over my code, it seems to be correct. Running the same code on a model without dataparallel works. Could you provide a small snippet of how you initialize your dataparallel model and run a mesh through it?

Feb 10 '20 23:02 aluo-x

I don't think I do something special on the DataParallel. I justed set face_texture at L44 of https://github.com/ShichenLiu/SoftRas/blob/master/soft_renderer/rasterizer.py to zero tensors because I do not use texture.

Feb 11 '20 00:02 mks0601

Note that that change probably not a solution of this error. Sorry I asked this question a while ago, so I cannot clearly remember what I did to fix this error.

Feb 11 '20 00:02 mks0601

Much appreciated. I'll try again some time later this week and report back with results.

Feb 11 '20 00:02 aluo-x

So it works now, following your code example that you provided. And checking via nvidia-smi seems to indicate that processing/memory is distributed between two GPUs.

It turns out the there were a few bugs, but they were all introduced when I modified SoftRas (mostly around texture/view transforms). I think we can close this issue now.

Feb 12 '20 17:02 aluo-x

SoftRas SoftRas copied to clipboard

Multi-gpu training

SoftRas
SoftRas copied to clipboard