tiny-cuda-nn icon indicating copy to clipboard operation
tiny-cuda-nn copied to clipboard

RuntimeError: /tmp/pip-req-build-q_3eo1qg/include/tiny-cuda-nn/cutlass_matmul.h:332 status failed with error Error Internal

Open jingweim opened this issue 2 years ago • 3 comments

Description and error stack trace

I'm working with nerfstudio, and one of its nerf models called nerfacto uses the FullyFusedMLP in tiny-cuda-nn. I'm running training on A100 and for my purpose I need to render a whole image at each training iteration. I noticed that whenever the batch size (number of rays) gets too big, this RuntimeError: /tmp/pip-req-build-q_3eo1qg/include/tiny-cuda-nn/cutlass_matmul.h:332 status failed with error Error Internal error would show up during loss.backward(), but training goes through just fine when I use a smaller batch size. I've ruled out OOM because the memory usage was only 41037MiB / 81920MiB when the process quit. Are there parameters inside fully_fused_mlp.cu that could limit training batch size?

Below is the error stack trace (I tried setting CUDA_LAUNCH_BLOCKING=1 but nothing new shows up):

╭───────────────────────── Traceback (most recent call last) ─────────────────────────╮
│ /*********************************.py:450 in         │
│ train_one_epoch                                                                     │
│                                                                                     │
│   449 │   │   │   loss = self.rgb_loss(target_rgb, pred_rgb)                        │
│ ❱ 450 │   │   │   loss.backward()                                                   │
│   451                                                                               │
│                                                                                     │
│ /************************/miniconda3/envs/conda-env/lib/pyt │
│ hon3.8/site-packages/functorch/_src/monkey_patching.py:77 in _backward              │
│                                                                                     │
│   74 │   │   │   "backward() called inside a functorch transform. This is not "     │
│   75 │   │   │   "supported, please use functorch.grad or functorch.vjp instead "   │
│   76 │   │   │   "or call backward() outside of functorch transforms.")             │
│ ❱ 77 │   return _old_backward(*args, **kwargs)                                      │
│   78                                                                                │
│   79                                                                                │
│   80 torch.Tensor.backward = _backward                                              │
│                                                                                     │
│ /************************/miniconda3/envs/conda-env/lib/pyt │
│ hon3.8/site-packages/torch/_tensor.py:396 in backward                               │
│                                                                                     │
│    393 │   │   │   │   retain_graph=retain_graph,                                   │
│    394 │   │   │   │   create_graph=create_graph,                                   │
│    395 │   │   │   │   inputs=inputs)                                               │
│ ❱  396 │   │   torch.autograd.backward(self, gradient, retain_graph, create_graph,  │
│    397 │                                                                            │
│    398 │   def register_hook(self, hook):                                           │
│    399 │   │   r"""Registers a backward hook.                                       │
│                                                                                     │
│ /************************/miniconda3/envs/conda-env/lib/pyt │
│ hon3.8/site-packages/torch/autograd/__init__.py:173 in backward
│                                                                                     │
│   170 │   # The reason we repeat same the comment below is that                     │
│   171 │   # some Python versions print out the first line of a multi-line function  │
│   172 │   # calls in the traceback and some print out the last line                 │
│ ❱ 173 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to  │
│   174 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,           │
│   175 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ e │
│   176                                                                               │
│                                                                                     │
│ /************************/miniconda3/envs/conda-env/lib/pyt │
│ hon3.8/site-packages/torch/autograd/function.py:253 in apply                        │
│                                                                                     │
│   250 │   │   │   │   │   │   │      "Function is not allowed. You should only impl │
│   251 │   │   │   │   │   │   │      "of them.")                                    │
│   252 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn       │
│ ❱ 253 │   │   return user_fn(self, *args)                                           │
│   254 │                                                                             │
│   255 │   def apply_jvp(self, *args):                                               │
│   256 │   │   # _forward_cls is defined by derived class                            │
│                                                                                     │
│ /************************/miniconda3/envs/conda-env/lib/pyt │
│ hon3.8/site-packages/tinycudann/modules.py:84 in backward                           │
│                                                                                     │
│    81 │   │   │   doutput = doutput.cuda()                                          │
│    82 │   │                                                                         │
│    83 │   │   input, params, output = ctx.saved_tensors                             │
│ ❱  84 │   │   input_grad, params_grad = _module_function_backward.apply(ctx, doutpu │
│    85 │   │                                                                         │
│                                                                                     │
│ /************************/miniconda3/envs/conda-env/lib/pyt │
│ hon3.8/site-packages/tinycudann/modules.py:95 in forward                            │
│                                                                                     │
│    92 │   │   ctx.save_for_backward(input, params, doutput)                         │
│    93 │   │   with torch.no_grad():                                                 │
│    94 │   │   │   scaled_grad = doutput * ctx_fwd.loss_scale                        │
│ ❱  95 │   │   │   input_grad, params_grad = ctx_fwd.native_tcnn_module.bwd(ctx_fwd. │
│    96 │   │   │   input_grad = null_tensor_like(input) if input_grad is None else ( │
│    97 │   │   │   params_grad = null_tensor_like(params) if params_grad is None els │
│    98 │   │   return input_grad, params_grad                                        │
╰─────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: /tmp/pip-req-build-0hvn9ip1/include/tiny-cuda-nn/cutlass_matmul.h:332 
status failed with error Error Internal

More details from breakpoints

Since the code fails at L95 in tinycudann/modules.py, I placed a checkpoint right before L95 and printed out the tensor sizes of input, params and output. Here are the sizes when the training resolution is 512x512 (batch size 262144), and at this resolution the above error shows up:

# first time reaches L95, 1258912=512x512x3x16
input.size(): torch.Size([12582912, 63])
params.size(): torch.Size([9216]) # field.mlp_head.params, last MLP module with 2 hidden layers, 9216=13*64+64*64+64*64 + 64*3
output.size(): torch.Size([12582912, 16])
gpu memory usage: 38483MiB / 81920MiB

# second time it reaches L95
input.size(): torch.Size([12582912, 3])
params.size(): torch.Size([0]) # not sure what this is
output.size(): torch.Size([12582912, 16])
gpu memory usage: 41507MiB / 81920MiB

# third time it reaches L95
input.size(): torch.Size([12582912, 3])
params.size(): torch.Size([12199312]) # The main MLP body
output.size(): torch.Size([12582912, 16])

# And then
RuntimeError: /tmp/pip-req-build-0hvn9ip1/include/tiny-cuda-nn/cutlass_matmul.h:332 status failed with error Error Internal

And then I saved the same stats for the largest resolution that works, 416x416 (batch size 173056):

# first time reaches L95, 8306688=416x416x3x16
input.size(): torch.Size([8306688, 63])
params.size(): torch.Size([9216]) # field.mlp_head.params, last MLP module with 2 hidden layers, 9216=13*64+64*64+64*64 + 64*3
output.size(): torch.Size([8306688, 16])
gpu memory usage: 29159MiB / 81920MiB

# second time it reaches L95
input.size(): torch.Size([8306688, 3])
params.size(): torch.Size([0]) # not sure what this is
output.size(): torch.Size([8306688, 16])
gpu memory usage: 31157MiB / 81920MiB

# third time it reaches L95
input.size(): torch.Size([8306688, 3])
params.size(): torch.Size([12199312]) # The main MLP body
output.size(): torch.Size([8306688, 16])
gpu memory usage: 31157MiB / 81920MiB (because last backward params was size 0)

# And then
==> Finished Step 1.

jingweim avatar Jan 09 '23 21:01 jingweim

Hi, based on testing on my end, CUTLASS seems to not handle problem sizes that exceed the 32-bit integer limit.

From your logs, the input matrix has size 8306688 * 63, which, when multiplied with the tensor core width (16), becomes ~8 billion. I suspect this is where 32-bit index calculations start to go wrong.

But even if that wasn't a problem, tiny-cuda-nn's custom FullyFusedMLP kernels also currently use 32-bit integers for performance. So they would likewise start to break down once you push the batch size by another order of magnitude. So... I don't think this is an easy path forward.

What would be easier for you is a workaround: slice your batch into chunks of, say, 1m elements, and compute parameter gradients for each of these chunks separately. Then, simply average those gradients. The resulting values will be the same as if you had computed them from a single large batch. (Ignoring fp32 order-of-addition quirks, which shouldn't be significant here.)

Performance-wise, this approach should also be more-or-less the same as the single large batch. 1m elements, fed into a neural network, should be enough to saturate your GPU.

Tom94 avatar Jan 10 '23 09:01 Tom94

Is there any update on this? I'm facing the same issue when training with NeRFstudio and Nerfacto (also on A100) - reducing the image size solves it but I'd like to retain the image resolution. Unfortunately, it's not possible for me to break my batch any further and compute gradients on these chunks. Would really appreciate any help!

jaidevshriram avatar Jun 16 '23 05:06 jaidevshriram

Facing the same issue here, A100, got this error if I set the batch size too large... Reduce batch size could avoid this, but only 20/80g GPU memory is used... :(

Nplace-su avatar Jan 18 '24 06:01 Nplace-su