pykan icon indicating copy to clipboard operation
pykan copied to clipboard

Runtime Error in hellokan.ipynb

Open wkqian06 opened this issue 9 months ago • 1 comments

I met the same issue that appeared in the previously closed issues #117 #46 #89

File ...\pykan\kan\spline.py:135, in curve2coef(x_eval, y_eval, grid, k, device)
    133 # x_eval: (size, batch); y_eval: (size, batch); grid: (size, grid); k: scalar
    134 mat = B_batch(x_eval, grid, k, device=device).permute(0, 2, 1)
--> 135 coef = torch.linalg.lstsq(mat.to('cpu'), y_eval.unsqueeze(dim=2).to('cpu')).solution[:, :, 0]  # sometimes 'cuda' version may diverge
    136 return coef.to(device)

RuntimeError: false INTERNAL ASSERT FAILED at "...\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\BatchLinearAlgebra.cpp":1540, please report a bug to PyTorch. torch.linalg.lstsq: (Batch element 0): Argument 6 has illegal value. Most certainly there is a bug in the implementation calling the backend library.

Changing the opt to 'Adam' did not solve the problem.

It seems that the driver in torch.linalg.lstsq should be specifically claimed instead of using 'None'.

Either 'LBFGS' or 'Adam' works when I use the following if statement in curve2coef though I don't know why.

    if device == 'cpu':
        coef = torch.linalg.lstsq(mat.to(device), y_eval.unsqueeze(dim=2).to(device),driver = 'gelsy').solution[:, :, 0]
    else: 
        coef = torch.linalg.lstsq(mat.to(device), y_eval.unsqueeze(dim=2).to(device),driver = 'gels').solution[:, :, 0]  # sometimes 'cuda' version may diverge

wkqian06 avatar May 12 '24 05:05 wkqian06

This seems promising, I'll test it too! Can you explain the difference between driver = 'gelsy' and driver = 'gels'? Also, probably that if block can be rewritten as coef = torch.linalg.lstsq(mat.to(device), y_eval.unsqueeze(dim=2).to(device),driver = 'gelsy' if device == 'cpu' else 'gels').solution[:, :, 0]

AlessandroFlati avatar May 12 '24 08:05 AlessandroFlati

Honestly, I don't know in detail. According to the torch.linalg.lstsq document, gelsy is a general QR factorization to solve least-squares using CPU. gels assumes the matrix is full rank. For cuda, gels is the only choice.

Something weird is that if the driver is set to None, as what the original code did, the driver should be automatically set to either gelsy or gels depending on the device setting. Then, the issue appears.

wkqian06 avatar May 12 '24 14:05 wkqian06

Well I think we can easily make a PR for this annoying bug with your suggestion (maybe just make it a oneliner as described above). If you're unsure how to do that, I'll take care of it.

AlessandroFlati avatar May 12 '24 15:05 AlessandroFlati

Sure, please go ahead. I'm new to the GitHub collaboration. Glad this could help.

wkqian06 avatar May 12 '24 15:05 wkqian06