pytorch3d icon indicating copy to clipboard operation
pytorch3d copied to clipboard

NaN when using MeshRasterizer

Open pengsongyou opened this issue 4 years ago • 32 comments

Description

I installed the latest pytorch3d 0.4 and tried to run the fit_textured_mesh tutorial under the Mesh prediction via silhouette rendering section. The loss becomes NaN after around 200 iterations (4 out of 5 times I can reproduce this issue).

I also tried pytorch3d 0.3 (built from source in December), and this issue never happened. Therefore, there might be some issues in the latest update for Mesh Rasterizer.

Reproduce

Install pytorch 1.7.1

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

Install pytorch3d using wheels for linux instruction

pip install pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/py38_cu102_pyt171/download.html

And then simply run the fit_textured_mesh tutorial, you should be able to reproduce results. I can obtain the nan 4 out 5 times I run.

Best, Songyou

pengsongyou avatar Feb 12 '21 15:02 pengsongyou

Thanks @pengsongyou for reporting this issue! We'll look into it asap.

nikhilaravi avatar Feb 12 '21 18:02 nikhilaravi

@pengsongyou I was able to reproduce the error. To resolve the issue in the tutorial add perspective_correct=False in the RasterizationSettings for the rasterizer. In v0.4 we changed this to be automatically inferred from the camera type but there seems to be some instability due to this. We will debug what is happening!

nikhilaravi avatar Feb 12 '21 19:02 nikhilaravi

Great, now it indeed seems working, thanks a lot! I have been always using the perspective camera model, but I did not need to turn perspective_correct=False when I was using 0.3 because no issue was found. Just wondering if you could explain why we need to make it explicitly False now in 0.4?

Thanks so much in advance!

Best, Songyou

pengsongyou avatar Feb 12 '21 19:02 pengsongyou

@pengsongyou the perspective_correct setting basically ensures that the barycentric coordinates are correct under a perspective camera. This is not corrected in other differentiable renderers like SoftRas/NMR/DIB-R which assume that the perspective effects are small. In the previous version of PyTorch3D this was an optional setting but in the most recent release we decided to set it based on the type of the camera. We will investigate why this is causing nans in the optimization.

nikhilaravi avatar Feb 19 '21 20:02 nikhilaravi

Hi, I have encountered similar NaN error in rasterizer :/. I just wanna provide another example that might help the team to debug. But as far as right now, perspective_correct=False / Orthogonal camera solves this particular case (Thanks Nikhila and Georgia)

NaN seems to happen when the rendered faces is parallel to the ray. (maybe relevant to the previous issue #110.) I provided my triangle that caused nan fragments in the file: triangle.pkl, together with my script:

    fname = 'triangle.pkl'
    device = 'cuda:0'
    with open(fname, 'rb') as fp:
        obj = pickle.load(fp)
        triangle = obj['tri']
        triangle = triangle.to(device)

    cameras = PerspectiveCameras(100., device=device)
    blend_params = BlendParams(sigma=1e-4, gamma=1e-4)
    dist_eps = 1e-6
    raster_settings = RasterizationSettings(
        image_size=224,
        blur_radius=np.log(1. / dist_eps - 1.) * blend_params.sigma,
        faces_per_pixel=100,
        # perspective_correct=False, # this seems solve the nan error at least for this 
    )
    rasterizer = MeshRasterizer(cameras=cameras, raster_settings=raster_settings).to(device)
    fragments = rasterizer(triangle)
    print(fragments.zbuf.isnan().any() ,fragments.bary_coords.isnan().any())
    # True, True for me

The triangle looks like this in 3D: 3d and this in screen space: 2d visualization code:

from mpl_toolkits import mplot3d
import numpy as np
import matplotlib.pyplot as plt
import pickle
fname = '/tmp/transfer/vis/triangle.pkl'
with open(fname, 'rb')  as fp:
	triangle = pickle.load(fp)
verts = triangle['verts']
verts2d = triangle['verts_screen']

def refract_verts(verts):
	verts = np.vstack([verts, verts[0:1]])
	return verts
verts = refract_verts(verts)
verts2d = refract_verts(verts2d)

fig = plt.figure()
ax = plt.axes(projection='3d')

ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')

ax.plot3D(verts[:, 0], verts[:, 1], verts[:, 2], 'gray')

fig = plt.figure()
plt.plot(verts2d[:, 0], verts2d[:, 1])
plt.show()

Thanks and good luck.

JudyYe avatar Apr 18 '21 21:04 JudyYe

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 23 '21 05:06 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Jun 29 '21 05:06 github-actions[bot]

I will look into this issue! Thanks for the explanation @JudyYe.

nikhilaravi avatar Jun 29 '21 18:06 nikhilaravi

~Hey - I am experiencing the same issue (nans after about 200 iteration steps). perspective_correct=False doesn't seem to help though :(~

EDIT: I didn't notice they were multiple RasterizationSettings instances. Works now!

$ pip list | grep torch
pytorch3d                         0.4.0
torch                             1.7.1+cu110

tals avatar Jul 23 '21 19:07 tals

I can confirm both that this bug still exists in 0.5.0, and that setting perspective_correct=False removes the issue. I ran my code with torch anomaly detection on, not sure if it's helpful. Here's the relevant portion of the anomaly detection output:

  File "/home/jim/Documents/python/pytorch3d/pytorch3d/renderer/mesh/renderer.py", line 59, in forward
    fragments = self.rasterizer(meshes_world, **kwargs)
  File "/home/jim/anaconda3/envs/armo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jim/Documents/python/pytorch3d/pytorch3d/renderer/mesh/rasterizer.py", line 171, in forward
    pix_to_face, zbuf, bary_coords, dists = rasterize_meshes(
  File "/home/jim/Documents/python/pytorch3d/pytorch3d/renderer/mesh/rasterize_meshes.py", line 231, in rasterize_meshes
    pix_to_face, zbuf, barycentric_coords, dists = _RasterizeFaceVerts.apply(
 (function _print_stack)

And the error message:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_89528/4272499749.py in <module>
     80     optimizer.zero_grad()
---> 81     loss.backward()
     82     optimizer.step()

~/anaconda3/envs/armo/lib/python3.8/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    219                 retain_graph=retain_graph,
    220                 create_graph=create_graph)
--> 221         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    222 
    223     def register_hook(self, hook):

~/anaconda3/envs/armo/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    128         retain_graph = create_graph
    129 
--> 130     Variable._execution_engine.run_backward(
    131         tensors, grad_tensors_, retain_graph, create_graph,
    132         allow_unreachable=True)  # allow_unreachable flag

RuntimeError: Function '_RasterizeFaceVertsBackward' returned nan values in its 0th output.

jbohnslav avatar Aug 19 '21 18:08 jbohnslav

Hi same here on perspective_correct=False fixing the issue. In my settings, I also have that (even when perspective_correct=True) if the mesh and renderer are on on the CPU, I no longer get NaNs.

Additionally, I was also wondering if having FoVPerspectiveCameras camera + perspective_correct=False for the rasterization setting is equivalent to having a weak perspective camera?

dukleryoni avatar Sep 01 '21 23:09 dukleryoni

I can confirm that this bug is present in v0.5.0 using the out of the box tutorial "camera_position_optimization_with_differentiable_rendering". perspective_correct=False fixes the issue.

rubenverhack avatar Sep 07 '21 08:09 rubenverhack

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Oct 08 '21 05:10 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 08 '21 05:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Nov 14 '21 05:11 github-actions[bot]

pytorch                   1.9.1           py3.9_cuda11.1_cudnn8.0.5_0    pytorch
pytorch3d                 0.6.0                    pypi_0    pypi
RasterizationSettings(image_size=(h, w),
                                   blur_radius=0,
                                   faces_per_pixel=1,
                                   perspective_correct=False)
# with
PerspectiveCameras()

This issue still there

by

(render_images != render_images).sum()

to check

Matt-V50 avatar Dec 07 '21 12:12 Matt-V50

@FavorMylikes Yes indeed! The PR was merged since the latest release 0.6.0.

bottler avatar Dec 07 '21 13:12 bottler

I have installed PyTorch3D with the following command: pip3 install "git+https://github.com/facebookresearch/pytorch3d.git" and I still get NaN when running the tutorial fit_textured_mesh.py

simon-cross avatar Dec 13 '21 16:12 simon-cross

@simon-cross Do you still have the output of that command? It's quite likely something's gone wrong and you've ended up with the current release not the latest code. Maybe best to have this conversation on a new issue. A new release 0.6.1 is imminent which should include the fix btw.

bottler avatar Dec 13 '21 18:12 bottler

I have created the following issue related to the NaN problem: issue 991

simon-cross avatar Dec 13 '21 19:12 simon-cross

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jan 13 '22 05:01 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 13 '22 05:02 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 17 '22 05:03 github-actions[bot]

Hello, the NaN problem still exists. In my cases, I use RasterzationSettings as follows:

raster_settings = RasterizationSettings(
            image_size=(self.img_h, self.img_w),
            blur_radius=0,
            faces_per_pixel=1,
            perspective_correct=False,
        )    

My environment:

# Name                    Version                   Build  Channel
pytorch                   1.8.1           py3.7_cuda11.1_cudnn8.0.5_0    pytorch
pytorch3d                 0.6.1                     dev_0    <develop>

TimmmYang avatar Mar 31 '22 09:03 TimmmYang

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 01 '22 05:05 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar May 07 '22 05:05 github-actions[bot]

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 09 '22 05:06 github-actions[bot]

I'm also running into periodic NaNs w/ the mesh rasterizer. Seems to occur with the HardDepthShader in https://github.com/facebookresearch/pytorch3d/pull/1208 which is about as simple as you can get shading wise

d4l3k avatar Jun 20 '22 06:06 d4l3k

I turned on anomaly detection and traced those NaNs back to transform_points denom correction in

https://github.com/facebookresearch/pytorch3d/blob/main/pytorch3d/renderer/mesh/rasterizer.py#L193-L196

Might be a good idea to change eps so it's not None. Seems to be set in a lot of places so None seems like a bad default given the potential bad behavior. I set it to eps=1e-8 and seems to have solved it. Implicitron looks like it's set to 1e-2 which seems very large

https://github.com/facebookresearch/pytorch3d/blob/7978ffd1e4819d24803b01a1147a2c33ad97c142/pytorch3d/implicitron/tools/point_cloud_utils.py#L73

d4l3k avatar Jun 22 '22 07:06 d4l3k

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 23 '22 05:07 github-actions[bot]