kaolin icon indicating copy to clipboard operation
kaolin copied to clipboard

About randomness in kaolin DIBR

Open lai-pf opened this issue 2 years ago • 17 comments

about the randomness. Are there any randomness in kaolin rendering? I follow a paper which use kaolin as render, although I have fixed all the seed, I found it's gradient are not the same. It seems that gradient in backward is different leads to different optimization results. Are this randomness comes from kaolin render? image In the process of optimization,little randomness in iteration one can make the optimization results really different. we use cuda as rast-backend in DIBR function. image I found the issue about randomness in nvdiffrast https://github.com/NVlabs/nvdiffrast/issues/13#issuecomment-767484493 but my code use the cuda rast-backend not the nvdiffrast. So I just want to know that are this randomness comes from kaolin and Inevitable? or this randomness comes from my code part? thanks very much, I would appreciate it if you could answer my question, this can be really helpful.

lai-pf avatar Oct 23 '22 14:10 lai-pf

Hi, can someone help me? I've checked my code again, and I still found randomness in the gradient of my network last layer. I think it is kaolin that leads to this non-determinism. I don't know where I'm wrong, request help again.

lai-pf avatar Oct 25 '22 15:10 lai-pf

Hi @lai-pf , to my knowledge there is no source of non-determinism in Kaolin's rendering, you can easily check by doing something like that:

input1 = input1.detach().clone()
input1.requires_grad()
input2 = input1.detach().clone()
input2.requires_grad = True
output1 = function_to_test(input1)
grad_output = torch.rand_like(output1)
output1.backward(grad_output)
output2 = function_to_test(input2)
output2.backward(grad_output)
print(torch.equal(input1.grad, input2.grad))

Caenorst avatar Oct 25 '22 15:10 Caenorst

Hi Caenorst, thanks for your help, it really help me a lot. I follow the test demo that you suggested, the gradient is the same when I use single mesh. when I use DIBR to render two mesh in one scene, I find the randomness. It seems that when we concat 2 mesh faces and vertices in one mesh, it leads to somthing error in judge the weights of vertices which compose a pixel. Have you seen this problem before? To reproduce the problem,you can use DIBR and concat two mesh like this mesh_A,mesh_B,scene_Mesh vertices_num_A = mesh_A.vertices.shape[0] mesh_B.faces[:,:] += (vertices_num_A) tmp_vertices = concat( [mesh_A.vertices,mesh_B.vertices] ) tmp_faces = concat( [mesh_A.faces,mesh_B.faces] ) scene_Mesh.vertices = tmp_vertices scene_Mesh.faces = tmp_faces And simply build a network with only one mlp layer. Through this layer's gradient, with this https://github.com/NVIDIAGameWorks/kaolin/issues/638#issuecomment-1290792293 test demo, the problem can be reproduced. If there are any resolution can solve this problem, please let me know. Thanks again again again. My work has been stopped here for a long time,and your help makes me feel hopeful again.

lai-pf avatar Oct 26 '22 14:10 lai-pf

Hi @lai-pf , you need to add an offset to mesh_B.faces:

tmp_faces = concat([mesh_A.faces, mesh_B.faces + mesh_A.vertices.shape[0]])

Caenorst avatar Oct 26 '22 15:10 Caenorst

Do you have this gradient difference if you just render a single mesh? (mesh_A?)

Caenorst avatar Oct 26 '22 15:10 Caenorst

I use the test demo that you suggested in the reply https://github.com/NVIDIAGameWorks/kaolin/issues/638#issuecomment-1290792293 in single mesh rendering I haven't seen difference in gradient, it seems right.

lai-pf avatar Oct 26 '22 16:10 lai-pf

Also in mesh_B?

Caenorst avatar Oct 26 '22 16:10 Caenorst

I didn't try B. in kaolin's DIBR, should the mesh be watertight mesh ?

lai-pf avatar Oct 26 '22 16:10 lai-pf

Rasterization should work with non watertigh meshes

Caenorst avatar Oct 26 '22 16:10 Caenorst

Hi , Caenorst, I've tried single mesh_B a hat, the single mesh hat has different gradient. And when I use single mesh_A the gradient is right.

lai-pf avatar Oct 26 '22 16:10 lai-pf

So the single mesh_B leads to non-deterministic gradient? Can you share the model?

Caenorst avatar Oct 26 '22 16:10 Caenorst

can you send me a email? My gmail is [email protected], I can sent the mesh with code through email

lai-pf avatar Oct 26 '22 16:10 lai-pf

There is indeed a source of non-determinism, which is probably coming from the atomicAdd here: https://github.com/NVIDIAGameWorks/kaolin/blob/master/kaolin/csrc/render/mesh/rasterization_cuda.cu#L391-L398

The differences in values are in the 1e-6 magnitude which should be very negligible, I would argue that if that leads to a fails vs success in an optimization pipeline then probably something else is wrong.

Unfortunately making an efficient deterministic version is not that straightforward. One way you can do (but would strongly affect the speed of the kernel) is the following:

  1. change blocks to 1
  2. put the atomicAdd in a for loop as following:
for (int i = 0; i < blockDim.x; i++) {
    if (threadIdx.x == i) {
        atomicAdd(grad_face_vertices_image + start_image_idx + 0, dldI * dIdax);
        atomicAdd(grad_face_vertices_image + start_image_idx + 1, dldI * dIday);
      
        atomicAdd(grad_face_vertices_image + start_image_idx + 2, dldI * dIdbx);
        atomicAdd(grad_face_vertices_image + start_image_idx + 3, dldI * dIdby);
      
        atomicAdd(grad_face_vertices_image + start_image_idx + 4, dldI * dIdcx);
        atomicAdd(grad_face_vertices_image + start_image_idx + 5, dldI * dIdcy);
    }
    __syncthreads();
}

Caenorst avatar Oct 26 '22 17:10 Caenorst

You wanna do the same thing here

That have resolved most of the non-determinism (for some reason I still have some rare occurences but I can't pin where it is coming from now)

Caenorst avatar Oct 26 '22 17:10 Caenorst

你想在这里做同样的事情

这已经解决了大部分不确定性(由于某种原因,我仍然有一些罕见的情况,但我无法确定它现在来自哪里)

I follow the suggestion. change 1 image change 2 image

change 3 image

but the gradient in case of hat mesh is still different like before. Is there any change I forgot? And in my task, running speed is not important to me, but reproducibility matters, so if there is any possible solution to make DIBR stable and deterministic? Or do we have CPU version in kaolin's DIBR that can make sure to have the same result each time?

lai-pf avatar Oct 27 '22 04:10 lai-pf

And I found something strange, I use other mesh, (person.obj and shoe.obj) as a scene to render, the gradient is the same. It seems like only the hat mesh is special? Is the hat different in some way? In my test today,I found that some meshes will make gradient different, some won't, I think those meshes belong to one special class. Do you know anything about this?

lai-pf avatar Oct 27 '22 04:10 lai-pf

I found another source of non-determinism, replace the kaolin.ops.mesh.index_vertices_by_faces by the following:

def index_vertices_by_faces(vertices_features, faces):
    r"""Index vertex features to convert per vertex tensor to per vertex per face tensor.
    Args:
        vertices_features (torch.FloatTensor):
            vertices features, of shape
            :math:`(\text{batch_size}, \text{num_points}, \text{knum})`,
            ``knum`` is feature dimension, the features could be xyz position,
            rgb color, or even neural network features.
        faces (torch.LongTensor):
            face index, of shape :math:`(\text{num_faces}, \text{num_vertices})`.
    Returns:
        (torch.FloatTensor):
            the face features, of shape
            :math:`(\text{batch_size}, \text{num_faces}, \text{num_vertices}, \text{knum})`.
    """
    assert vertices_features.ndim == 3, \
        "vertices_features must have 3 dimensions of shape (batch_size, num_points, knum)"
    assert faces.ndim == 2, "faces must have 2 dimensions of shape (num_faces, num_vertices)"
    return vertices_features[:, faces]

Caenorst avatar Oct 27 '22 17:10 Caenorst