PySR Symbolic deep learning

Trying to recreate the examples from this paper PySR is always predicting scalars as a low complexity solution, which doesn't make much sense, can you please elaborate on that? And what is wrong why I'm unable to get the right expression?

Cycles per second: 3.050e+03
Progress: 19 / 20 total iterations (95.000%)
Hall of Fame:
-----------------------------------------
Complexity  Loss       Score     Equation
1           1.278e-01  -9.446e-02  -0.08741549
2           1.165e-01  9.256e-02  square(-0.18644808)
3           2.592e-02  1.503e+00  (x0 * -0.2923665)
5           1.682e-02  2.163e-01  ((-0.10430038 * x0) * x2)
8           1.576e-02  2.176e-02  (1.6735333 * sin((-0.067048885 * x0) * x2))

The code used to generate this is:

import numpy as np
from pysr import pysr, best

# Dataset
X = np.array(messages_over_time[-1][['dx', 'dy', 'r', 'm1', 'm2']]) # Taken from this notebook https://github.com/MilesCranmer/symbolic_deep_learning/blob/master/GN_Demo_Colab.ipynb
y = np.array(messages_over_time[-1]['e64'])

# Learn equations
equations = pysr(X, y, niterations=5,
    binary_operators=["plus", "mult" , 'sub', 'pow', 'div'],
    unary_operators=[
      "cos", "exp", "sin", 'neg', 'square', 'cube', 'exp', 
      "inv(x) = 1/x"], batching=True, batchSize=1000) 


print(best(equations))

Mar 04 '21 19:03 abdalazizrashid

Hi @abdalazizrashid,

Thanks for checking out the paper!! It's great you are trying it out with PySR. I should actually update that colab notebook to directly use PySR, that would be cool.

You mean scalar as in output is ℝ, right? So, PySR is actually made to predict exclusively scalar functions. It builds expressions as: f: ℝⁿ → ℝ, where n is the number of features (columns of X). It can't find vector functions at the moment. In the GNN case, basically each message component is a single scalar ℝ (with many examples), and we try to fit the functions for each message component independently. Does this make more sense?

Btw, for your specific example, I would note that there appear to be several degeneracies, like "sub" and "neg" are technically the same, and also "square", "cube", "pow", "mult" do similar things. This will slow PySR down because it might be harder to simplify things. Genetic algorithms like the one used in PySR have bad scaling in high-dimensions, so the fewer features and fewer operators you give them, the better.

Here is what I would suggest you try:

equations = pysr(X[:1000], y[:1000], niterations=5,
    binary_operators="* + / -".split(" "),
    unary_operators=[], annealing=False, useFrequency=True,
    npopulations=20, optimizer_algorithm="BFGS", 
    optimizer_iterations=10,
    optimize_probability=1
    )

These extra arguments will be default in v0.6.0 (#33) but for now you need to manually enter them. I might also increase niterations=5 to niterations=100 if it still is unable to find the expressions. This is basically the # of training steps.

Let me know if you have other questions. Cheers, Miles

Mar 04 '21 20:03 MilesCranmer

Oh, and by the way, you can give a pandas array as input to X instead of numpy if you want. Then it will use the column names as variables in the equations.

Mar 04 '21 20:03 MilesCranmer

One more thing: depending on which law you are searching for, you might need to increase the maxsize argument.

Mar 04 '21 22:03 MilesCranmer

Update: I used the suggested arguments and set iterations=100 and let it run overnight and it still hasn't finished. So the question, How long did it take to reproduce the original example in the notebook using Eureqa? And aas far as I know, PySR uses genetic algorithms which are gradient-free optimization methods, so can you elaborate on how are we using the BFGS optimizer for this task?

Mar 05 '21 09:03 abdalazizrashid

Hm, it should find it very fast. Question: what does this plot look like for you in the colab notebook? Screen Shot 2021-03-06 at 2 34 14 AM It should be linear like this^. If it's not, it means the messages aren't equal to the forces yet and you should train for longer. IIRC the colab doesn't train the neural network for very long, so you might want to increase the number of steps?

Also, I would also try the loss="L1DistLoss()" which is the same loss we used in the paper (absolute error) rather than the default L2DistLoss (which might be more sensitive to outliers).

BFGS or the default NelderMead are used for optimizing the constants. BFGS uses gradients and NelderMead is a simplex method. The genetic algorithm is also used for optimizing the constants, but this optimizer is used for "fine tuning" the constants once in a while.

Mar 06 '21 07:03 MilesCranmer

@MilesCranmer Thanks for the response. After training for 200 epochs this what I got, what do you think? Also, all this is a reproduction of what's in the notebook.

scrn-2021-03-08-23-32-34

Mar 08 '21 20:03 abdalazizrashid

Oh, wait, I see three plots; are you running one of the 3D force laws? You will need to pass 'dz' to the symbolic regression to PySR because z direction is used in the 3D force laws.

But other than this, those messages still don't look linear enough. What force law is this? Is this the L1 regularization, or another kind?

Mar 08 '21 21:03 MilesCranmer

Yes, it's 3D. This is the spring simulation, with L1 regularization.

Mar 09 '21 09:03 abdalazizrashid

I see, thanks. So, I would definitely pass 'dz' to PySR as well as the other parameters since that appears in the force law. (The notebook is just set up for the 2D sim as a demo, but for other cases it requires some manual changes.) Can you also train with n = 8 instead of n = 4? (trains with 8 particles instead of 4 particles).

By the way, I just noticed colab updated to PyTorch 1.8.0. I updated the notebook for this.

Mar 09 '21 10:03 MilesCranmer

This what I got after training for 200 epoch with n=8

Mar 09 '21 22:03 abdalazizrashid

I am confused about what the issue here could be. Want to email me to discuss this more? My email is miles<dot>cranmer at gmail

Mar 13 '21 10:03 MilesCranmer

PySR PySR copied to clipboard

Symbolic deep learning

PySR
PySR copied to clipboard