s4 icon indicating copy to clipboard operation
s4 copied to clipboard

Memory Corruption Error in Kernel _setup_linear

Open ethanbar11 opened this issue 2 years ago • 2 comments

Hey, I'm trying to use the forward_state function. From time to time, I get this error:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

Jumping out of:

File "/media/data2/ethan_baron/state-spaces-improv/src/models/sequence/ss/kernel.py", line 434, in _setup_linear
    R = torch.linalg.solve(R.to(Q_D), Q_D)  # (H r N)

Meaning, from this lines (433-436) in the NPLR Kernel:

        try:
            R = torch.linalg.solve(R.to(Q_D), Q_D)  # (H r N)
        except torch._C._LinAlgError:
            R = torch.tensor(np.linalg.solve(R.to(Q_D).cpu(), Q_D.cpu())).to(Q_D)

I changed very little this lines for debugging, for:

try:
    R = torch.linalg.solve(R.to(Q_D), Q_D)  # (H r N)
except:
    x1 = R.to(Q_D).cpu()
    x2 = R.to(Q_D).cpu()
    R = torch.tensor(np.linalg.solve(x1, x2)).to(Q_D)

EDIT: Removed stacktrace (was quite unhelpful and long) and edited the code to be in code snippets.

ethanbar11 avatar Jul 20 '22 15:07 ethanbar11

I looked into this recently and also found the same issue, which wasn't present before. I wasn't able to figure out why. It's weird that it happens randomly.

Regardless, the implementation of "state forwarding" (README) is currently unoptimized for S4 so it is not recommended to use this. If you want this functionality, it should work with S4D. Feel free to file another issue if something comes up.

Finally, could you please edit the original issue here to be shorter, and in particular remove at least the last part of the stack trace. It might also help to put the whole thing in a code block. The last few lines are all parsed in a way that references other Issues which is confusing.

albertfgu avatar Aug 09 '22 18:08 albertfgu

Yeah, I tried to look into it for a couple of days and didn't understand what happened. I'm using now the S4D forward_state version and until now it works quite well. Edited the issue, hopefully to be more readable. Thanks!

ethanbar11 avatar Aug 10 '22 05:08 ethanbar11