pykan
pykan copied to clipboard
M1 runtime fails with "AssertionError: Torch not compiled with CUDA enabled"
Hi! Thanks a lot for the awesome paper and implementation!
I can't get it to run on my M1 machine.
I built pytorch
from source, with disabled CUDA options, as per https://github.com/IAMAl/PyTorch4M1
I tried setting device = "cpu"
and poked around randomly but I always get the same error while trying to run the examples:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[1], line 6
2 import torch
4 # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
----> 6 model = KAN(width=[2,3,2,1], device='cpu')
7 model.to(model.device)
8 x = torch.normal(0,1,size=(100,2))
File [~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KAN.py:140](https://file+.vscode-resource.vscode-cdn.net/Users/rmrfxyz/dev/chaos/pykan/tutorials/~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KAN.py:140), in KAN.__init__(self, width, grid, k, noise_scale, noise_scale_base, base_fun, symbolic_enabled, bias_trainable, grid_eps, grid_range, sp_trainable, sb_trainable, device, seed)
137 for l in range(self.depth):
138 # splines
139 scale_base = 1 [/](https://file+.vscode-resource.vscode-cdn.net/) np.sqrt(width[l]) + (torch.randn(width[l] * width[l + 1], ) * 2 - 1) * noise_scale_base
--> 140 sp_batch = KANLayer(in_dim=width[l], out_dim=width[l + 1], num=grid, k=k, noise_scale=noise_scale, scale_base=scale_base, scale_sp=1., base_fun=base_fun, grid_eps=grid_eps, grid_range=grid_range, sp_trainable=sp_trainable,
141 sb_trainable=sb_trainable, device=device)
142 self.act_fun.append(sp_batch)
144 # bias
File [~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KANLayer.py:126](https://file+.vscode-resource.vscode-cdn.net/Users/rmrfxyz/dev/chaos/pykan/tutorials/~/conda/envs/pykan-env/lib/python3.9/site-packages/kan/KANLayer.py:126), in KANLayer.__init__(self, in_dim, out_dim, num, k, noise_scale, scale_base, scale_sp, base_fun, grid_eps, grid_range, sp_trainable, sb_trainable, device)
124 self.scale_base = torch.nn.Parameter(torch.ones(size, device=device) * scale_base).requires_grad_(sb_trainable) # make scale trainable
125 else:
--> 126 self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable)
127 self.scale_sp = torch.nn.Parameter(torch.ones(size, device=device) * scale_sp).requires_grad_(sp_trainable) # make scale trainable
128 self.base_fun = base_fun
...
286 raise AssertionError(
287 "libcudart functions unavailable. It looks like you have a broken build?"
288 )
AssertionError: Torch not compiled with CUDA enabled
What am I missing 🤔
You should put into the requirements torch==2.3.0+cu121
or whatever cuda version you need.
Actually, latest master version is bugged, without https://github.com/KindXiaoming/pykan/pull/98 @KindXiaoming
But pytorch is built locally and not installed through requirements.txt, as that fails on M1 since there is no CUDA available. So I built it from source and installed it in conda env separately.
I find it confusing that the error says "torch NOT compiled with CUDA", since I have to explicitly disable those options before building - otherwise it fails to install.
So I'm thinking maybe the failure is in the pytorch build, not in pykan... Maybe? I'll try to fiddle with the makefile, maybe I'm overlooking something there.
Hi! Yesterday I was able to run in M1 Max chip with the following versions (on anaconda environment) Name Version Build Channel torch 2.3.0 pypi_0 pypi torchaudio 2.3.0 pypi_0 pypi torchvision 0.18.0 pypi_0 pypi It is extremely slow compared with also CPU version in windows. Idk if it makes any difference but I do not send the model via torch just this lines: kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=2, k=3, seed=0) kan_model.train(my_ds, opt="LBFGS", steps=2, lamb=0.01, lamb_entropy=10.)
As you see, the problem stands in line
self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable)
which in a previous (bad) tentative of allowing people to use CUDA, forced the parameter to be on cuda. You can edit that line yourself if you just want to use CPU, but we should really just wait for the PR to be accepted.
Also, I'm unable to run any KAN model in GPU. I send to device (cuda) both the dataset and the model but keeps giving me this error: device = torch.device("cuda") dataset = {} dataset["train_input"] = torch.from_numpy(np.array(X_train)) dataset["test_input"] = torch.from_numpy(np.array(X_test)) dataset["train_label"] = torch.from_numpy(np.array(Y_train)) dataset["test_label"] =torch.from_numpy(np.array(Y_test)) for key, value in dataset.items(): dataset[key] = dataset[key].to(device) kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=3, k=3, seed=0, device = device) kan_model.train(dataset, opt="LBFGS", steps=50, lamb=0, lamb_entropy=0, device = device)
Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
@AlessandroFlati
I tried that to change that to mps
but didn't work. (didn't expect it to...)
Idk pretty far out of my comfort zone, tbh.
Alright, glad to hear a PR is in the pipeline, I'll wait for that. Thanks!
@gonzalalGFM Cheers! Maybe I'll give it a try until the PR gets merged.
You should actually change it to cpu
, not to mps
.
Also, I'm unable to run any KAN model in GPU. I send to device (cuda) both the dataset and the model but keeps giving me this error: device = torch.device("cuda") dataset = {} dataset["train_input"] = torch.from_numpy(np.array(X_train)) dataset["test_input"] = torch.from_numpy(np.array(X_test)) dataset["train_label"] = torch.from_numpy(np.array(Y_train)) dataset["test_label"] =torch.from_numpy(np.array(Y_test)) for key, value in dataset.items(): dataset[key] = dataset[key].to(device) kan_model = KAN(width=[2, 1, grid_size * grid_size], grid=3, k=3, seed=0, device = device) kan_model.train(dataset, opt="LBFGS", steps=50, lamb=0, lamb_entropy=0, device = device)
Error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Also fixed by the PR.
As you see, the problem stands in line
self.scale_base = torch.nn.Parameter(torch.FloatTensor(scale_base).cuda()).requires_grad_(sb_trainable)
which in a previous (bad) tentative of allowing people to use CUDA, forced the parameter to be on cuda. You can edit that line yourself if you just want to use CPU, but we should really just wait for the PR to be accepted.
I tried to edit that line using the code from your fork,(device = torch.device('cpu')),but it still AssertionError: Torch not compiled with CUDA enabled
That's strange. Could you please create a reproducible gist/snippet where I can try to reproduce your case in order to further expand the PR if needed? That would very much appreciated!
That's strange. Could you please create a reproducible gist/snippet where I can try to reproduce your case in order to further expand the PR if needed? That would very much appreciated!
Sorry,My falut,I just copied the code in /kan from your fork ,I thought you have editted,I edit those line ,It works,But get some new errors when I ran below:
dataset = {}
train_input, train_label = make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=None)
test_input, test_label = make_moons(n_samples=1000, shuffle=True, noise=0.1, random_state=None)
dataset['train_input'] = torch.from_numpy(train_input)
dataset['test_input'] = torch.from_numpy(test_input)
dataset['train_label'] = torch.from_numpy(train_label[:, None])
dataset['test_label'] = torch.from_numpy(test_label[:, None])
device = torch.device('cpu')
X = dataset['train_input']
y = dataset['train_label']
plt.scatter(X[:, 0], X[:, 1], c=y[:, 0])
model = KAN(width=[2, 1], grid=3, k=3, device=device)
def train_acc():
return torch.mean((torch.round(model(dataset['train_input'])[:, 0]) == dataset['train_label'][:, 0]).float())
def test_acc():
return torch.mean((torch.round(model(dataset['test_input'])[:, 0]) == dataset['test_label'][:, 0]).float())
results = model.train(dataset, opt="LBFGS", steps=20, metrics=(train_acc, test_acc))
print(results['train_acc'][-1], results['test_acc'][-1])
got errors like this:
description: 0%| | 0/20 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\JUSTIN200\Desktop\pykan\example\test.py", line 32, in <module>
results = model.train(dataset, opt="LBFGS", steps=20, metrics=(train_acc, test_acc))
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KAN.py", line 899, in train
self.update_grid_from_samples(dataset['train_input'][train_id].to(device))
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KAN.py", line 244, in update_grid_from_samples
self.forward(x)
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KAN.py", line 312, in forward
x_numerical, preacts, postacts_numerical, postspline = self.act_fun[l](x)
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\KANLayer.py", line 175, in forward
y = coef2curve(x_eval=x, grid=self.grid[self.weight_sharing], coef=self.coef[self.weight_sharing], k=self.k, device=self.device) # shape (size, batch)
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\kan\spline.py", line 100, in coef2curve
y_eval = torch.einsum('ij,ijk->ik', coef, B_batch(x_eval, grid, k, device=device))
File "C:\Users\JUSTIN200\.conda\envs\kan\lib\site-packages\torch\functional.py", line 380, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
RuntimeError: expected scalar type Double but found Float
os:Windows 11
torch_version:Version: 2.2.2
I just think, as the RuntimeError describes, you do not have to cast to float through .float()
, or maybe cast it as double
I just think, as the RuntimeError describes, you do not have to cast to float through
.float()
, or maybe cast it as double
Thanks,
python
dataset['train_input'] = torch.from_numpy(train_input).float()
dataset['test_input'] = torch.from_numpy(test_input).float()
dataset['train_label'] = torch.from_numpy(train_label[:, None]).float()
dataset['test_label'] = torch.from_numpy(test_label[:, None]).float()
It works for me
latex
train loss: 1.58e-01 | test loss: 1.62e-01 | reg: 1.94e+00 : 100%|██| 20/20 [00:01<00:00, 16.32it/s]
1.0 0.996999979019165