pykan
pykan copied to clipboard
About some concerns for integration of KAN into regular NN
Hi, Ziming! While trying your very detailed tutorial, I’ve found a severe issue which may undermine the effectiveness of integrating KAN into other regular neural networks.
For instance, in tutorial/Example_4_symbolic_regression.ipynb, you’ve shown how to do symbolic regression with KAN. But if we alter the range of dataset input from [-1, 1] to [-1.25, 1.25], the model would fail to even be sparse. That’s because the spline functions are not defined outside [-1, 1] in the first place. Well, true enough that with a given dataset, we can force the grid range to be exactly the range of data. But what if we use KAN as a hidden layer of other model? The ranges of latent variables are not bounded, even after normalization! (unless we use some unstable normalizations like projecting the maximum and minimum of a batch to 1 and -1). So, how can we resolve that? Say, after a batch normalization (std=1), then set the range of KAN grids to be [-3, 3] (following the 3σ principle)? Or can we use some weird techniques like expanding the left 2k splines beyond the left bound, and the same for the right side?
could you please share more info? For example, what's the plot looking like? The update_grid_from_sample
used in training should handle the issue of data not lying in [-1,1]. But it could be the sparsification is errorneous and more work needs to be done to make regularization more effective. These hyperparameters can be tried lamb, lamb_entropy, seed
. Especially try pumping up lamb, lamb_entropy
for more sparsity. check if grid has been updated, you may print model.act_fun[0].grid
.
How about different data distribution? eg:
mean=-10, std=10
mean=-0.5, std=0.00000001
mean=0, std=10 mean=0, std=1 mean=0, std=0.00000001 #important, std very small, need scale by myself?
mean=+0.5, std=0.00000001
....
It's recommended to normalize your data still (despite KAN's ability to update grid, the effect of scales is still subtle as far as I can tell. In my experience, normalized data work the best), and then pass the normalizer argument to symbolic_formula
to fix the effect of normalization, or you can fix it manually. I should have a notebook on how to use that soon.
FWIW, I have a model similar to this, but using piecewise lagrange polynomials and 2 approaches I've used to solve this problem, one is I reflect at the boundaries [-1,1] which I don't particularly like, the other is use an infinity norm normalization (or similar) that is parameterless. Otherwise, often times other normalizations work fine. However, I have used it on a number of different types of networks, but yes, perfecting issues like this need to be figured out, but you can make it work for deep networks https://github.com/jloveric/high-order-layers-torch, I've used them up 20 layers deep. BTW, I like to call these types of approaches high-order neural networks as that reflects what they would typically be called in computational physics.
BTW, there was on other paper (probably many more recently since polynomial nets have become more popular) that solved this by doing a linear extension beyond [-1,1] https://arxiv.org/abs/1906.10064 . I've never implemented this myself, but think this would work quite well. Also, if one is not using LBFGS, using the lion optimizer works pretty well as it uses the sign of the gradients.
could you please share more info? For example, what's the plot looking like? The
update_grid_from_sample
used in training should handle the issue of data not lying in [-1,1]. But it could be the sparsification is errorneous and more work needs to be done to make regularization more effective. These hyperparameters can be triedlamb, lamb_entropy, seed
. Especially try pumping uplamb, lamb_entropy
for more sparsity. check if grid has been updated, you may printmodel.act_fun[0].grid
.
Thanks, Ziming! I'll try using update_grid_from_samples per several epochs. Can I comprehend this function to be: projecting f(x) to f(kx+b) according to the range of given data?
BTW, the out-of-range circumstance model may look like this:
from kan import *
# create a KAN: 2D inputs, 1D output, and 5 hidden neurons. cubic spline (k=3), 5 grid intervals (grid=5).
model = KAN(width=[2,5,1], grid=5, k=3, seed=0)
# create dataset f(x,y) = exp(sin(pi*x)+y^2)
f = lambda x: torch.exp(torch.sin(torch.pi*x[:,[0]]) + x[:,[1]]**2)
dataset = create_dataset(f, n_var=2, ranges=[-1.25, 1.25])
dataset['train_input'].shape, dataset['train_label'].shape
model.train(dataset, opt="LBFGS", steps=200, lamb=0.01, lamb_entropy=10.);
I've test the grid=20 as well, it functions well on [-1.25, 1.25], but badly on [-2, 2].
BTW, there was on other paper (probably many more recently since polynomial nets have become more popular) that solved this by doing a linear extension beyond [-1,1] https://arxiv.org/abs/1906.10064 . I've never implemented this myself, but think this would work quite well. Also, if one is not using LBFGS, using the lion optimizer works pretty well as it uses the sign of the gradients.
Thanks for your insight! My thoughts were that now that they're piecewise polynomials, why not just simply unleash their 'boundaries'? This is a natural analytic continuation of order k (the order of the polynomial).