PySR
PySR copied to clipboard
Symbolic deep learning
Trying to recreate the examples from this paper PySR is always predicting scalars as a low complexity solution, which doesn't make much sense, can you please elaborate on that? And what is wrong why I'm unable to get the right expression?
Cycles per second: 3.050e+03
Progress: 19 / 20 total iterations (95.000%)
Hall of Fame:
-----------------------------------------
Complexity Loss Score Equation
1 1.278e-01 -9.446e-02 -0.08741549
2 1.165e-01 9.256e-02 square(-0.18644808)
3 2.592e-02 1.503e+00 (x0 * -0.2923665)
5 1.682e-02 2.163e-01 ((-0.10430038 * x0) * x2)
8 1.576e-02 2.176e-02 (1.6735333 * sin((-0.067048885 * x0) * x2))
The code used to generate this is:
import numpy as np
from pysr import pysr, best
# Dataset
X = np.array(messages_over_time[-1][['dx', 'dy', 'r', 'm1', 'm2']]) # Taken from this notebook https://github.com/MilesCranmer/symbolic_deep_learning/blob/master/GN_Demo_Colab.ipynb
y = np.array(messages_over_time[-1]['e64'])
# Learn equations
equations = pysr(X, y, niterations=5,
binary_operators=["plus", "mult" , 'sub', 'pow', 'div'],
unary_operators=[
"cos", "exp", "sin", 'neg', 'square', 'cube', 'exp',
"inv(x) = 1/x"], batching=True, batchSize=1000)
print(best(equations))
Hi @abdalazizrashid,
Thanks for checking out the paper!! It's great you are trying it out with PySR. I should actually update that colab notebook to directly use PySR, that would be cool.
You mean scalar as in output is ℝ, right? So, PySR is actually made to predict exclusively scalar functions. It builds expressions as: f: ℝⁿ → ℝ, where n is the number of features (columns of X). It can't find vector functions at the moment. In the GNN case, basically each message component is a single scalar ℝ (with many examples), and we try to fit the functions for each message component independently. Does this make more sense?
Btw, for your specific example, I would note that there appear to be several degeneracies, like "sub" and "neg" are technically the same, and also "square", "cube", "pow", "mult" do similar things. This will slow PySR down because it might be harder to simplify things. Genetic algorithms like the one used in PySR have bad scaling in high-dimensions, so the fewer features and fewer operators you give them, the better.
Here is what I would suggest you try:
equations = pysr(X[:1000], y[:1000], niterations=5,
binary_operators="* + / -".split(" "),
unary_operators=[], annealing=False, useFrequency=True,
npopulations=20, optimizer_algorithm="BFGS",
optimizer_iterations=10,
optimize_probability=1
)
These extra arguments will be default in v0.6.0 (#33) but for now you need to manually enter them. I might also increase niterations=5 to niterations=100 if it still is unable to find the expressions. This is basically the # of training steps.
Let me know if you have other questions. Cheers, Miles
Oh, and by the way, you can give a pandas array as input to X instead of numpy if you want. Then it will use the column names as variables in the equations.
One more thing: depending on which law you are searching for, you might need to increase the maxsize
argument.
Update: I used the suggested arguments and set iterations=100 and let it run overnight and it still hasn't finished. So the question, How long did it take to reproduce the original example in the notebook using Eureqa? And aas far as I know, PySR uses genetic algorithms which are gradient-free optimization methods, so can you elaborate on how are we using the BFGS optimizer for this task?
Hm, it should find it very fast. Question: what does this plot look like for you in the colab notebook?
It should be linear like this^. If it's not, it means the messages aren't equal to the forces yet and you should train for longer. IIRC the colab doesn't train the neural network for very long, so you might want to increase the number of steps?
Also, I would also try the loss="L1DistLoss()"
which is the same loss we used in the paper (absolute error) rather than the default L2DistLoss (which might be more sensitive to outliers).
BFGS or the default NelderMead are used for optimizing the constants. BFGS uses gradients and NelderMead is a simplex method. The genetic algorithm is also used for optimizing the constants, but this optimizer is used for "fine tuning" the constants once in a while.
@MilesCranmer Thanks for the response. After training for 200 epochs this what I got, what do you think? Also, all this is a reproduction of what's in the notebook.
Oh, wait, I see three plots; are you running one of the 3D force laws? You will need to pass 'dz'
to the symbolic regression to PySR because z direction is used in the 3D force laws.
But other than this, those messages still don't look linear enough. What force law is this? Is this the L1 regularization, or another kind?
Yes, it's 3D. This is the spring simulation, with L1 regularization.
I see, thanks. So, I would definitely pass 'dz'
to PySR as well as the other parameters since that appears in the force law. (The notebook is just set up for the 2D sim as a demo, but for other cases it requires some manual changes.) Can you also train with n = 8
instead of n = 4
? (trains with 8 particles instead of 4 particles).
By the way, I just noticed colab updated to PyTorch 1.8.0. I updated the notebook for this.
This what I got after training for 200 epoch with n=8
I am confused about what the issue here could be. Want to email me to discuss this more? My email is miles<dot>
cranmer at gmail