gaussian-splatting icon indicating copy to clipboard operation
gaussian-splatting copied to clipboard

CUDA error in backward.

Open HatsuneMiku888 opened this issue 2 years ago • 14 comments
trafficstars

Hi, I experienced RuntimeError: an illegal memory access was encountered when I train 3d gaussian on the T&T dataset. It seems to happen in backpropagation. Here is the input of the backward function.

And the error disappeared when I commented out https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503. I have no idea about why this line would cause illegal memory access.

HatsuneMiku888 avatar Aug 09 '23 10:08 HatsuneMiku888

Hi,

commenting that line as you did will significantly change the math of the gradient computation and should give you very bad results. We are currently at Siggraph, but when we get back we will see what we can find from the .dump you shared.

Snosixtyboo avatar Aug 09 '23 21:08 Snosixtyboo

Thanks for your reply! I know commenting that line can't be a final solution. It just to locate where the things going wrong. I mean the backpropagation can passed successfully under the same input without that line.

HatsuneMiku888 avatar Aug 10 '23 01:08 HatsuneMiku888

The same problem appears to me, there are 3 issues for the invalid memory now, and none of them can work out... could someone help? thanks!

ray8828 avatar Aug 23 '23 03:08 ray8828

The same problem appears to me, there are 3 issues for the invalid memory now, and none of them can work out... could someone help? thanks!

Hi @ray8828 , if you have that issue can you post your hardware setup and the .dump for when the crash occurred? Creating the dump file requires running with --debug

Snosixtyboo avatar Aug 23 '23 08:08 Snosixtyboo

@HatsuneMiku888 I finally had the time to look at your output. It seems that you are using both Python-computed covariance matrices and colors (--convert_SHs_python and --convert_cov3D_python are active), any particular reason for this? We left those paths in for compatibility and experimenting, they are not heavily tested.

Snosixtyboo avatar Aug 23 '23 16:08 Snosixtyboo

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation:

image For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off?

Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

Snosixtyboo avatar Aug 23 '23 16:08 Snosixtyboo

I have met the same problem, after commenting out the line mentioned above, the code works well.(https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503 )

rgxie avatar Aug 29 '23 03:08 rgxie

I have met the same problem, after commenting out the line mentioned above, the code works well.(https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503 )

Hi,

please note that this is not a fix, it will completely break the math behind the approach. If you continue to have issues with running it, please consider using the Colab linked on the main page.

Snosixtyboo avatar Aug 29 '23 07:08 Snosixtyboo

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation:

image For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off?

Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

1073280485 is very close to 2^30, maybe there are some numeric overflow?

HatsuneMiku888 avatar Aug 29 '23 12:08 HatsuneMiku888

@HatsuneMiku888 how good is your Python? Could you force it to create the snapshow_fw.dump of the forward pass (even tho it doesn't fail) for the frame where the backward fails and forward it to us?

Snosixtyboo avatar Aug 29 '23 12:08 Snosixtyboo

I have met the same problem, after commenting out the line mentioned above, the code works well.(https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/main/cuda_rasterizer/backward.cu#L503 )

Hi,

please note that this is not a fix, it will completely break the math behind the approach. If you continue to have issues with running it, please consider using the Colab linked on the main page.

Thank you for your reply. I know that is not a fix. I am trying to locate the bug, this error occurs at different iterations when I use different data.

rgxie avatar Aug 29 '23 12:08 rgxie

@HatsuneMiku888 how good is your Python? Could you force it to create the snapshow_fw.dump of the forward pass (even tho it doesn't fail) for the frame where the backward fails and forward it to us?

Sure, I will attempt to reproduce this error on the machine where it occurred.

Btw, now I have a new problem. I faced the same Illegal memory access error during the forward training process on other dataset. But the error miraculously disappeared when I executed _C.rasterize_gaussians using snapshot_fw.dump as parameters in a separate script.

HatsuneMiku888 avatar Aug 29 '23 13:08 HatsuneMiku888

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation:

image For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off?

Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

Hello,I have the same error. And I want to know how to debug the cuda code in gaussian-splatting.I just know how to debug the python file.

wuchen133 avatar Apr 29 '24 12:04 wuchen133

@HatsuneMiku888 I found the line that causes the crash. Unfortunately, I have no explanation: image For some reason, a point ID with a number that is way too high gets into the list of points to render. Unfortunately, I don't know how I could debug this without extensive access to the machine that it happens on. We could set this up, but it will take a while before I have time to do this. From the dump alone I have no idea how this could occur. Is it reproducible? Does it also happen when the two options I mentioned above are turned off? Last but not least, also for @ray8828 , another user has set up a Colab that seems to successfully run the code base on T&T. This could hopefully reduce issues with local project setups, so maybe this will work out for you https://github.com/camenduru/gaussian-splatting-colab

Hello,I have the same error. And I want to know how to debug the cuda code in gaussian-splatting.I just know how to debug the python file.

do you know the result,thank you

smart4654154 avatar Aug 20 '24 03:08 smart4654154