[BUG]: Loss saved in hall of fame is not identical to loss recalculated based on prediction
Hello,
I have identified an issue with some PySR regressions, that I ran. The loss saved in the hall of fame is not identical to the loss calculated with the same dataset and same custom loss function based on the PySR prediction. I have documented the issue in the notebook file attached. Have similar problems occurred before or do you have any idea, what might be the reason for this? As an additional test, i was thinking about trying to recalculate the loss of the individual inside PySR, the be able to reproduce the potentially faulty loss-calculation. Is there a way to do so?
Best regards and thanks in advance!
Jupyter Notebook Code:
import pickle
import numpy as np
import pandas as pd
from pysr import jl
from pysr.julia_helpers import jl_array
# Import object of class implementing regression with PySR (pysr_object.regression_object is PySRRegressor)
with open('cluster_2_round_0_runs_4_conv_suc_pysr_object_0103.pkl', 'rb') as file:
pysr_object = pickle.load(file)
# Get loss of 22. individual for 2. y-variable
loss_in_pysr = pysr_object.regression_object.equations_[2].iloc[22].loss
# Training dataset used for regression
training_dataset = pysr_object.dataset.dataset[pysr_object.training_data_bool].reset_index(drop=True)
# Predict y-variables for training data with the 22. individual for the 2. y-variable
y_pred= pd.DataFrame(
data=pysr_object.regression_object.predict(
training_dataset[pysr_object.x_variables],
index = [0,0,22,0,0]
),
columns=pysr_object.y_variables
)
# Get prediction and actual data for 2. y-variable
y2_pred = y_pred[pysr_object.y_variables[2]]
y2_act = training_dataset[pysr_object.y_variables[2]]
# Create julia loss function string based on dataset
def make_fitness_function_v2_5on(y):
# For code see attached PDF
# Create julia custom loss function string, apart from in- and outputs identical to the one implemented in pysr_object (see below)
fitness_function = make_fitness_function_v2_5on(training_dataset[pysr_object.y_variables])
print(fitness_function)
#Printed:
function custom_loss_function(y_pred, y)
if isapprox(y[1],2.2565436)
p5 = 1.2804586
p80 = 15.113608
a_1 = 0.6248880798745418
b_1 = 1.9206878219999999
a_2 = 2
b_3 = -6.09607886912224
a_3 = 18.0350585617555
c_3 = -9.43495030694149
elseif isapprox(y[1],0.04650065)
p5 = 0.0040257196
p80 = 1.8914138
a_1 = 11.144567754841631
b_1 = 0.006038579739964286
a_2 = 2
b_3 = 0.0400789096218153
a_3 = 3.86298538267220
c_3 = 1.23985086650155
elseif isapprox(y[1],0.6221858)
p5 = 0.46299943
p80 = 13.051656
a_1 = 1.0391895552455737
b_1 = 0.694499148
a_2 = 2
b_3 = 4.05385061247809
a_3 = 34.2110131049562
c_3 = -71.0354535190814
elseif isapprox(y[1],-0.45914933)
p5 = 0.41124585
p80 = 5.6052365
a_1 = 1.1026412343064227
b_1 = 0.616868799645
a_2 = 2
b_3 = 0.514240734249554
a_3 = 12.2389548644991
c_3 = -10.9601082467183
elseif isapprox(y[1],0.8109809)
p5 = 0.00813762
p80 = 17.448
a_1 = 7.838560195225847
b_1 = 0.012206430080501547
a_2 = 2
b_3 = 6.53098971058287
a_3 = 47.9579776388101
c_3 = -117.475032351869
end
y_border = zeros(length(y))
y_border[abs.(y).<p5] .= (a_1 * y[abs.(y).<p5]).^2 .+ b_1
y_border[(abs.(y).>=p5).&(abs.(y).<=p80)] .= a_2 * abs.(y[(abs.(y).>=p5).&(abs.(y).<=p80)])
y_border[abs.(y).>p80] .= a_3 * log.(abs.(y[abs.(y).>p80]) .+ b_3) .+ c_3
custom_error = (sum( ( (1 ./ y_border) .* (y_pred .- y) ) .^2 ) )/(length(y))
return custom_error
end
# For comparison: loss function of pysr_object
loss_function_pysr = pysr_object.regression_object.loss_function
print(loss_function_pysr)
# Printed:
function custom_loss_function(tree, dataset::Dataset{T,L}, options::Options, idx=nothing)::L where {T,L}
if isapprox(dataset.y[1],2.2565436)
p5 = 1.2804586
p80 = 15.113608
a_1 = 0.6248880798745418
b_1 = 1.9206878219999999
a_2 = 2
b_3 = -6.09607886912224
a_3 = 18.0350585617555
c_3 = -9.43495030694149
elseif isapprox(dataset.y[1],0.04650065)
p5 = 0.0040257196
p80 = 1.8914138
a_1 = 11.144567754841631
b_1 = 0.006038579739964286
a_2 = 2
b_3 = 0.0400789096218153
a_3 = 3.86298538267220
c_3 = 1.23985086650155
elseif isapprox(dataset.y[1],0.6221858)
p5 = 0.46299943
p80 = 13.051656
a_1 = 1.0391895552455737
b_1 = 0.694499148
a_2 = 2
b_3 = 4.05385061247809
a_3 = 34.2110131049562
c_3 = -71.0354535190814
elseif isapprox(dataset.y[1],-0.45914933)
p5 = 0.41124585
p80 = 5.6052365
a_1 = 1.1026412343064227
b_1 = 0.616868799645
a_2 = 2
b_3 = 0.514240734249554
a_3 = 12.2389548644991
c_3 = -10.9601082467183
elseif isapprox(dataset.y[1],0.8109809)
p5 = 0.00813762
p80 = 17.448
a_1 = 7.838560195225847
b_1 = 0.012206430080501547
a_2 = 2
b_3 = 6.53098971058287
a_3 = 47.9579776388101
c_3 = -117.475032351869
end
if idx == nothing
y_pred, complete = eval_tree_array(tree, dataset.X, options)
y = dataset.y
else
y_pred, complete = eval_tree_array(tree, dataset.X[:,idx], options)
y = dataset.y[idx]
end
y_border = zeros(length(y))
y_border[abs.(y).<p5] .= (a_1 * y[abs.(y).<p5]).^2 .+ b_1
y_border[(abs.(y).>=p5).&(abs.(y).<=p80)] .= a_2 * abs.(y[(abs.(y).>=p5).&(abs.(y).<=p80)])
y_border[abs.(y).>p80] .= a_3 * log.(abs.(y[abs.(y).>p80]) .+ b_3) .+ c_3
if !complete
return L(Inf)
end
custom_error = (sum( ( (1 ./ y_border) .* (y_pred .- y) ) .^2 ) )/(length(y))
return custom_error
end
# Translating elements to julia
fitness_function_jl = jl.seval(fitness_function)
y2_pred_jl = jl_array(np.array(y2_pred, dtype=np.float32).T)
y2_act_jl = jl_array(np.array(y2_act, dtype=np.float32).T)
loss_recalculated = fitness_function_jl(y2_pred_jl, y2_act_jl)
print('Loss saved in PySR: ' + str(loss_in_pysr))
print('Loss recalculated: ' + str(loss_recalculated))
# Printed:
Loss saved in PySR: 0.01720243
Loss recalculated: 0.022473989882165576
-> Loss recalculated and loss calculated in PySR are not identical!
Version
1.3.1
Operating System
Windows
Package Manager
Conda
Interface
Other (specify below)
Relevant log output
Extra Info
@BrotherHa is it okay if you paste the code inline in the markdown block? This is so that search indexes like Google can correctly find this issue if other people have a similar problem.
Hello @MilesCranmer Thank you very much for the quick reply. Is the format ok like that? Best regards
I have searched a bit where the issue occurs and in fact for most individuals of the regression the loss values from the equations_-table and the recalculated ones are absolutely identical until very high precision, but for some rare cases they divert significantly as in the example.
Is it only when reloading from a pickle file?
And are you sure you are computing the loss in a way completely identical between the Julia and the Python side of things?
About the loss functions I am very sure. The only issues here might be problems caused by the precision of calculation, etc. And for more than 90% of all individuals both methods calculate exactly the same loss. And concerning the pickle-file, I have also tried using the 'from_file' method, but it still stays the same.
Just to confirm, it is only when you load the model from a saved checkpoint that this happens? It’s not when you use warm_start=True and call fit a second time?
90% of all individuals both methods calculate exactly the same loss
Do you see any pattern in the ones that don’t match? e.g., Are they very complex expressions?
90% of all individuals both methods calculate exactly the same loss
Do you see any pattern in the ones that don’t match? e.g., Are they very complex expressions?
I do not see any pattern in the equations, that have the error. They are not the most complex ones neither do they have explicitly complex operators.
Just to confirm, it is only when you load the model from a saved checkpoint that this happens? It’s not when you use
warm_start=Trueand callfita second time?
It happens also when using fit for the first time. I will try to reload and refit, when I find time.
But it's only when you have a model loaded from disk, right? Like if you did:
model = PySRRegressor(warm_start=True) # fresh model; NOT loaded from disk
model.fit(X, y)
#= ... =#
model.fit(X, y)
The Problem occured in the first place when fitting a new model. Then directly after the regression finished I called predict() and calculated the loss for training, test and total dataset based on that. But before that I saved the output of model.get_best(). The value from get_best() and the manual calculation for the best individual shows the bahavior in a less significant way (0.014545 instead of 0.014536), but it is still not in the range most other values are equal. For the example given in my first messege here I called the predict() method also just after the initial run and calculated the loss as 0.022473989882165576, but I did not save the loss from model.equations_ separately. But in the hall of fame csv-file loss is also saved as 0.01720243, like it is outputed when I call model.equations_ from the loaded model.
@BrotherHa would it be possible for you to make a MWE of this issue so I can try to reproduce it? No worries if not. It's a bit tricky because the unittests already check for warm starts from files and they seem to pass. So I'm puzzled as to what conditions trigger this.
I have encountered a similar problem when using a custom cosine-similarity-based loss.
Example Code
import numpy as np
import pandas as pd
from pysr import PySRRegressor
nsamples = 30
np.random.seed(4283223)
x1 = np.random.uniform(size=nsamples)
x2 = np.random.uniform(size=nsamples)
x3 = np.random.uniform(size=nsamples)
x4 = np.random.uniform(size=nsamples)
X = pd.DataFrame({"x_1": x1, "x_2": x2, "x_3": x3, "x_4": x4})
target = (x1 * x4 - x3 * x2) ** 3
model = PySRRegressor(
random_state=4208438902,
deterministic=True,
parallelism="serial",
niterations=30,
binary_operators=["+", "-", "*", "/"],
unary_operators=["cube", "square"],
loss_function="""
using Statistics
function eval_loss(tree, dataset::Dataset{T,L}, options)::L where {T,L}
prediction, _ = eval_tree_array(tree, dataset.X, options)
ratio = prediction ./ dataset.y
norm = sqrt(sum(abs2, prediction) * sum(abs2, dataset.y))
cos_value = sum(prediction .* dataset.y) / norm
loss = abs(acos(clamp(cos_value, 0.0, 1.0)))
scale = mean(abs, prediction)
scale_loss = scale + 1 / (scale + 1e-6)
return L(loss + 1e-6 * scale_loss)
end
""",
)
model.fit(X, target)
print(model)
print("Formula:", model.sympy())
prediction = model.predict(X)
norm = np.sqrt(np.sum(target**2) * np.sum(prediction**2))
cos_value = np.sum(target * prediction) / norm
loss = np.abs(np.arccos(np.clip(cos_value, 0, 1)))
scale = np.mean(np.abs(prediction))
scale_loss = scale + 1 / (scale + 1e-6)
print("Computed loss in Python:", loss + 1e-6 * scale_loss)
PySRRegressor.equations_ = [
pick score equation loss complexity
0 0.000000 2.642676e-25 1.000000 1
1 11.553640 cube(1.9243316e22) 0.000010 2
2 >>>> 1.328369 square(cube(1.9243316e22)) 0.000003 3
]
Formula: 5.07784534534534e+133
Computed loss in Python: 5.077845345345341e+127
The reported loss is very small, but it should be a very large one.
Although I have specified the random seed and make PySR deterministic, the final result is still random.
Here are outputs of all five runs
PySRRegressor.equations_ = [
pick score equation loss complexity
0 0.000000 2.642676e-25 1.000000 1
1 11.553640 cube(1.9243316e22) 0.000010 2
2 >>>> 1.328369 square(cube(1.9243316e22)) 0.000003 3
]
Formula: 5.07784534534534e+133
Computed loss in Python: 5.077845345345341e+127
PySRRegressor.equations_ = [
pick score equation loss complexity
0 0.000000 7.33e-43 1.000000 1
1 0.987759 1.1066597 / 0.0 0.138689 3
2 1.754098 cube((x_1 * x_4) - (x_2 * x_3)) 0.000022 8
3 >>>> 1.188177 cube(((x_1 * x_4) - (x_2 * x_3)) * 2.7796757) 0.000002 10
]
Formula: 21.4774339167278*(x_1*x_4 - x_2*x_3)**3
Computed loss in Python: 1.9999991259531853e-06
PySRRegressor.equations_ = [
pick score equation loss complexity
0 0.000000 1.3934125e-37 1.000000 1
1 11.240707 cube(1.3326811e38) 0.000013 2
2 0.557888 cube(square(1.3326811e38)) 0.000008 3
3 >>>> 0.611431 cube(square(1.3326811e38)) - 0.9821605 0.000002 5
4 0.008833 (cube(square(1.3326811e38)) - square(1.3070838... 0.000002 8
]
Formula: 5.60218481169532e+228 - 1*0.9821605
/home/ac/Projects/PKU/HallSR/pysr_bug.py:43: RuntimeWarning: overflow encountered in square
norm = np.sqrt(np.sum(target**2) * np.sum(prediction**2))
Computed loss in Python: 5.6021848116953185e+222
PySRRegressor.equations_ = [
pick score equation loss complexity
0 0.000000 1.5459748e-30 1.000000 1
1 0.057146 cube(cube(x_4)) 0.891998 3
2 0.302634 cube(x_1 * x_4) 0.659070 4
3 0.091833 cube(cube(x_1 + x_4)) 0.601242 5
4 0.398972 cube(x_4 * (x_1 - 0.44725403)) 0.403439 6
5 0.331675 cube(x_4 * (x_1 - cube(x_3))) 0.289556 7
6 9.506604 cube((x_4 * x_1) - (x_3 * x_2)) 0.000022 8
7 >>>> 1.188146 cube(((x_1 * x_4) - (x_3 * x_2)) / 0.3583628) 0.000002 10
8 0.000006 (cube((x_4 * x_1) - (x_3 * x_2)) / 0.010429359... 0.000002 12
9 0.000026 ((cube((x_1 * x_4) - (x_2 * x_3)) * 0.83377784... 0.000002 14
]
Formula: 21.7285745471395*(x_1*x_4 - x_2*x_3)**3
Computed loss in Python: 2.000126025169295e-06
PySRRegressor.equations_ = [
pick score equation loss complexity
0 0.000000 1.0714004e-24 1.000000 1
1 0.780003 cube(5.0544116e29) 0.458405 2
2 10.262700 cube(cube(5.0544116e29)) 0.000016 3
3 0.082098 cube(cube(cube(5.0544116e29))) 0.000015 4
4 >>>> 0.375597 square(cube(square(square(cube(cube(1.3362112e... 0.000002 9
5 0.016263 square(cube(3.4662442e29) / cube(x_4 / cube(-1... 0.000002 16
]
Formula: 1.54533717529716e+8235*x_3
/home/ac/Projects/PKU/HallSR/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:86: RuntimeWarning: invalid value encountered in reduce
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
Computed loss in Python: nan
@AllanChain could you set precision=64 and try again? The backend defaults to 32-bit for speed
Thanks for the timely reply. Using precision=64 fixes the issue. I didn't consider the precision problem because, even if it were a precision issue, the loss shouldn't become extremely small—values like Inf or NaN would be more expected.
I suspect it’s related to not handling the second argument of eval_tree_array in your custom loss function. The second argument tells you if the evaluation succeeded or not - if not, then the first argument is basically random, and should not be used. This is used to deal with stuff like Inf and NaN in a safe way
Thank you! It's my bad not looking into the documentation carefully, and I'm sorry for bringing in an unrelated issue.
No worries! And btw be sure to look through the discussions page as there are many examples of loss functions which might be helpful