instant-ngp icon indicating copy to clipboard operation
instant-ngp copied to clipboard

Got cutlass error: Error Internal at: 346

Open hturki opened this issue 3 years ago • 12 comments

I've noticed that setting the aabb_scale to a high value (8 or 16 on the fox scene for example) seems to fail when using the CutlassMLP, which one is forced to on the still-popular V100 GPU type for example:

Got cutlass error: Error Internal at: 346 Could not free memory: /home/cloudlet/hturki/instant-ngp/dependencies/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h:451 cudaDeviceSynchronize() failed with error operation not permitted when stream is capturing

hturki avatar Apr 08 '22 07:04 hturki

I got the same internal error 346 on my RTX 3080 and followed this with my correct GPU arch and it worked well https://github.com/NVlabs/instant-ngp/issues/219#issuecomment-1055141789

ialhashim avatar Apr 09 '22 07:04 ialhashim

@ialhashim - did you get 346 or 363 as described in the ticket you linked to? I'm pretty sure that I'm compiling against 70 (the V100 arch). Also things seem to work fine with low aabb_scale values such as 1, 2, or 4 - it's when you try with 8 or 16 when the error happens.

hturki avatar Apr 09 '22 17:04 hturki

It was 346. Another weird thing for me was on Windows no training was happening and I was getting a blurry mess. It seem compiling with the correct settings is crucial for newer cards.

ialhashim avatar Apr 09 '22 17:04 ialhashim

@hturki any luck in resolving the issue? I have experienced the same behavior (everything fine with 1,2 and 4 but failing with error 346 when setting for 8 and 16). I work on 1080Ti

lukszamarcin avatar Apr 20 '22 18:04 lukszamarcin

I still run into this issue unfortunately

hturki avatar Apr 20 '22 22:04 hturki

yeah, me as well, im running it on nvidia titan XP with 12gb memory, aabb scale of 4 takes up like 6 gb of vram and aabb scale of 8 gives me 346 error idk - i do see some warning abt not using fullyfusedMLP but using some other thing idk if that matters at all, help

jere357 avatar May 11 '22 16:05 jere357

Okay, I've found a solution to error 346 - it's MONEY. when i was using titan XP (cuda comp of 6.1) i could only train for aabb scales of [1,2,4] - then i switched to a machine that has rtx 2080 (cuda comp of 7.5) and i could train aabb_scales of [1,2,4,8,16] - hope this helps someone in their troubles

jere357 avatar May 16 '22 12:05 jere357

I have a different reason, when using A100 with large batch size (>32768), it yields this error too.

kwea123 avatar Jul 05 '22 01:07 kwea123

Have you found any soluton to it, I met the same issue too. My card is V100, arch70, also yields 'Got cutlass error: Error Internal at: 346 ' when batchsize is larger.

daili0015 avatar Aug 02 '22 01:08 daili0015

+1 Anyone have a solution for this yet?

msollami avatar Aug 12 '22 23:08 msollami

TL;DR Hey guys, I have circumvented this error by just setting aabb_scale=4 instead of 16 in the transforms.json file.


More explanation:

In my case of using V100 GPU, I encountered the 346 error when I run instant-ngp for my custom image set, whose camera parameters (stored in transforms.json) are calculated with COLMAP. On the contrary, I can run the fox demo without a problem. However, the fox demo gets 346 error when I compute the transforms.json by myself.

So I just compared the transform.json between mine and the provided one, which only significantly differs in aabb_scale. My COLMAP calculated one has a value of 16 whereas the provided one has a value of 4.

So I just change the value to 4 for other custom image sets' transforms.json and they all work now.

QhelDIV avatar Sep 04 '22 13:09 QhelDIV

TL;DR Hey guys, I have circumvented this error by just setting aabb_scale=4 instead of 16 in the transforms.json file.

More explanation:

In my case of using V100 GPU, I encountered the 346 error when I run instant-ngp for my custom image set, whose camera parameters (stored in transforms.json) are calculated with COLMAP. On the contrary, I can run the fox demo without a problem. However, the fox demo gets 346 error when I compute the transforms.json by myself.

So I just compared the transform.json between mine and the provided one, which only significantly differs in aabb_scale. My COLMAP calculated one has a value of 16 whereas the provided one has a value of 4.

So I just change the value to 4 for other custom image sets' transforms.json and they all work now.

Solved My problem! Thanks

ZXisSpider avatar Nov 03 '22 08:11 ZXisSpider