pykan icon indicating copy to clipboard operation
pykan copied to clipboard

About some concerns for integration of KAN into regular NN

Open Eachen-Soong opened this issue 9 months ago • 7 comments

Hi, Ziming! While trying your very detailed tutorial, I’ve found a severe issue which may undermine the effectiveness of integrating KAN into other regular neural networks.

For instance, in tutorial/Example_4_symbolic_regression.ipynb, you’ve shown how to do symbolic regression with KAN. But if we alter the range of dataset input from [-1, 1] to [-1.25, 1.25], the model would fail to even be sparse. That’s because the spline functions are not defined outside [-1, 1] in the first place. Well, true enough that with a given dataset, we can force the grid range to be exactly the range of data. But what if we use KAN as a hidden layer of other model? The ranges of latent variables are not bounded, even after normalization! (unless we use some unstable normalizations like projecting the maximum and minimum of a batch to 1 and -1). So, how can we resolve that? Say, after a batch normalization (std=1), then set the range of KAN grids to be [-3, 3] (following the 3σ principle)? Or can we use some weird techniques like expanding the left 2k splines beyond the left bound, and the same for the right side?

Eachen-Soong avatar May 04 '24 16:05 Eachen-Soong

could you please share more info? For example, what's the plot looking like? The update_grid_from_sample used in training should handle the issue of data not lying in [-1,1]. But it could be the sparsification is errorneous and more work needs to be done to make regularization more effective. These hyperparameters can be tried lamb, lamb_entropy, seed. Especially try pumping up lamb, lamb_entropy for more sparsity. check if grid has been updated, you may print model.act_fun[0].grid.

KindXiaoming avatar May 04 '24 18:05 KindXiaoming

How about different data distribution? eg:

mean=-10, std=10

mean=-0.5, std=0.00000001

mean=0, std=10 mean=0, std=1 mean=0, std=0.00000001 #important, std very small, need scale by myself?

mean=+0.5, std=0.00000001

....

yuedajiong avatar May 04 '24 18:05 yuedajiong

It's recommended to normalize your data still (despite KAN's ability to update grid, the effect of scales is still subtle as far as I can tell. In my experience, normalized data work the best), and then pass the normalizer argument to symbolic_formula to fix the effect of normalization, or you can fix it manually. I should have a notebook on how to use that soon.

KindXiaoming avatar May 04 '24 18:05 KindXiaoming

FWIW, I have a model similar to this, but using piecewise lagrange polynomials and 2 approaches I've used to solve this problem, one is I reflect at the boundaries [-1,1] which I don't particularly like, the other is use an infinity norm normalization (or similar) that is parameterless. Otherwise, often times other normalizations work fine. However, I have used it on a number of different types of networks, but yes, perfecting issues like this need to be figured out, but you can make it work for deep networks https://github.com/jloveric/high-order-layers-torch, I've used them up 20 layers deep. BTW, I like to call these types of approaches high-order neural networks as that reflects what they would typically be called in computational physics.

jloveric avatar May 04 '24 19:05 jloveric

BTW, there was on other paper (probably many more recently since polynomial nets have become more popular) that solved this by doing a linear extension beyond [-1,1] https://arxiv.org/abs/1906.10064 . I've never implemented this myself, but think this would work quite well. Also, if one is not using LBFGS, using the lion optimizer works pretty well as it uses the sign of the gradients.

jloveric avatar May 04 '24 22:05 jloveric

could you please share more info? For example, what's the plot looking like? The update_grid_from_sample used in training should handle the issue of data not lying in [-1,1]. But it could be the sparsification is errorneous and more work needs to be done to make regularization more effective. These hyperparameters can be tried lamb, lamb_entropy, seed. Especially try pumping up lamb, lamb_entropy for more sparsity. check if grid has been updated, you may print model.act_fun[0].grid.

Thanks, Ziming! I'll try using update_grid_from_samples per several epochs. Can I comprehend this function to be: projecting f(x) to f(kx+b) according to the range of given data?

BTW, the out-of-range circumstance model may look like this: 图片

from kan import *
# create a KAN: 2D inputs, 1D output, and 5 hidden neurons. cubic spline (k=3), 5 grid intervals (grid=5).
model = KAN(width=[2,5,1], grid=5, k=3, seed=0)

# create dataset f(x,y) = exp(sin(pi*x)+y^2)
f = lambda x: torch.exp(torch.sin(torch.pi*x[:,[0]]) + x[:,[1]]**2)
dataset = create_dataset(f, n_var=2, ranges=[-1.25, 1.25])
dataset['train_input'].shape, dataset['train_label'].shape

model.train(dataset, opt="LBFGS", steps=200, lamb=0.01, lamb_entropy=10.);

I've test the grid=20 as well, it functions well on [-1.25, 1.25], but badly on [-2, 2].

Eachen-Soong avatar May 05 '24 05:05 Eachen-Soong

BTW, there was on other paper (probably many more recently since polynomial nets have become more popular) that solved this by doing a linear extension beyond [-1,1] https://arxiv.org/abs/1906.10064 . I've never implemented this myself, but think this would work quite well. Also, if one is not using LBFGS, using the lion optimizer works pretty well as it uses the sign of the gradients.

Thanks for your insight! My thoughts were that now that they're piecewise polynomials, why not just simply unleash their 'boundaries'? This is a natural analytic continuation of order k (the order of the polynomial).

Eachen-Soong avatar May 05 '24 05:05 Eachen-Soong