PySR icon indicating copy to clipboard operation
PySR copied to clipboard

[BUG]: Loss saved in hall of fame is not identical to loss recalculated based on prediction

Open BrotherHa opened this issue 11 months ago • 19 comments

Hello,

I have identified an issue with some PySR regressions, that I ran. The loss saved in the hall of fame is not identical to the loss calculated with the same dataset and same custom loss function based on the PySR prediction. I have documented the issue in the notebook file attached. Have similar problems occurred before or do you have any idea, what might be the reason for this? As an additional test, i was thinking about trying to recalculate the loss of the individual inside PySR, the be able to reproduce the potentially faulty loss-calculation. Is there a way to do so?

Best regards and thanks in advance!

Jupyter Notebook Code:

import pickle
import numpy as np
import pandas as pd
from pysr import jl
from pysr.julia_helpers import jl_array
# Import object of class implementing regression with PySR (pysr_object.regression_object is PySRRegressor)
with open('cluster_2_round_0_runs_4_conv_suc_pysr_object_0103.pkl', 'rb') as file:
    pysr_object = pickle.load(file)
# Get loss of 22. individual for 2. y-variable
loss_in_pysr = pysr_object.regression_object.equations_[2].iloc[22].loss
# Training dataset used for regression
training_dataset = pysr_object.dataset.dataset[pysr_object.training_data_bool].reset_index(drop=True)
# Predict y-variables for training data with the 22. individual for the 2. y-variable
y_pred= pd.DataFrame(
    data=pysr_object.regression_object.predict(
        training_dataset[pysr_object.x_variables],
        index = [0,0,22,0,0]
    ),
    columns=pysr_object.y_variables
)
# Get prediction and actual data for 2. y-variable
y2_pred = y_pred[pysr_object.y_variables[2]]
y2_act = training_dataset[pysr_object.y_variables[2]]
# Create julia loss function string based on dataset
def make_fitness_function_v2_5on(y):
# For code see attached PDF
# Create julia custom loss function string, apart from in- and outputs identical to the one implemented in pysr_object (see below)
fitness_function = make_fitness_function_v2_5on(training_dataset[pysr_object.y_variables])
print(fitness_function)

#Printed:
function custom_loss_function(y_pred, y)
    if isapprox(y[1],2.2565436)
        p5 = 1.2804586
        p80 = 15.113608
        a_1 = 0.6248880798745418
        b_1 = 1.9206878219999999
        a_2 = 2
        b_3 = -6.09607886912224
        a_3 = 18.0350585617555
        c_3 = -9.43495030694149
    elseif isapprox(y[1],0.04650065)
        p5 = 0.0040257196
        p80 = 1.8914138
        a_1 = 11.144567754841631
        b_1 = 0.006038579739964286
        a_2 = 2
        b_3 = 0.0400789096218153
        a_3 = 3.86298538267220
        c_3 = 1.23985086650155
    elseif isapprox(y[1],0.6221858)
        p5 = 0.46299943
        p80 = 13.051656
        a_1 = 1.0391895552455737
        b_1 = 0.694499148
        a_2 = 2
        b_3 = 4.05385061247809
        a_3 = 34.2110131049562
        c_3 = -71.0354535190814
    elseif isapprox(y[1],-0.45914933)
        p5 = 0.41124585
        p80 = 5.6052365
        a_1 = 1.1026412343064227
        b_1 = 0.616868799645
        a_2 = 2
        b_3 = 0.514240734249554
        a_3 = 12.2389548644991
        c_3 = -10.9601082467183
    elseif isapprox(y[1],0.8109809)
        p5 = 0.00813762
        p80 = 17.448
        a_1 = 7.838560195225847
        b_1 = 0.012206430080501547
        a_2 = 2
        b_3 = 6.53098971058287
        a_3 = 47.9579776388101
        c_3 = -117.475032351869
    end
    y_border = zeros(length(y))
    y_border[abs.(y).<p5] .= (a_1 * y[abs.(y).<p5]).^2 .+ b_1
    y_border[(abs.(y).>=p5).&(abs.(y).<=p80)] .= a_2 * abs.(y[(abs.(y).>=p5).&(abs.(y).<=p80)])
    y_border[abs.(y).>p80] .= a_3 * log.(abs.(y[abs.(y).>p80]) .+ b_3) .+ c_3
    custom_error = (sum( ( (1 ./ y_border) .* (y_pred .- y) ) .^2 ) )/(length(y))
    return custom_error
end
# For comparison: loss function of pysr_object
loss_function_pysr = pysr_object.regression_object.loss_function
print(loss_function_pysr)

# Printed:
function custom_loss_function(tree, dataset::Dataset{T,L}, options::Options, idx=nothing)::L where {T,L}
    if isapprox(dataset.y[1],2.2565436)
        p5 = 1.2804586
        p80 = 15.113608
        a_1 = 0.6248880798745418
        b_1 = 1.9206878219999999
        a_2 = 2
        b_3 = -6.09607886912224
        a_3 = 18.0350585617555
        c_3 = -9.43495030694149
    elseif isapprox(dataset.y[1],0.04650065)
        p5 = 0.0040257196
        p80 = 1.8914138
        a_1 = 11.144567754841631
        b_1 = 0.006038579739964286
        a_2 = 2
        b_3 = 0.0400789096218153
        a_3 = 3.86298538267220
        c_3 = 1.23985086650155
    elseif isapprox(dataset.y[1],0.6221858)
        p5 = 0.46299943
        p80 = 13.051656
        a_1 = 1.0391895552455737
        b_1 = 0.694499148
        a_2 = 2
        b_3 = 4.05385061247809
        a_3 = 34.2110131049562
        c_3 = -71.0354535190814
    elseif isapprox(dataset.y[1],-0.45914933)
        p5 = 0.41124585
        p80 = 5.6052365
        a_1 = 1.1026412343064227
        b_1 = 0.616868799645
        a_2 = 2
        b_3 = 0.514240734249554
        a_3 = 12.2389548644991
        c_3 = -10.9601082467183
    elseif isapprox(dataset.y[1],0.8109809)
        p5 = 0.00813762
        p80 = 17.448
        a_1 = 7.838560195225847
        b_1 = 0.012206430080501547
        a_2 = 2
        b_3 = 6.53098971058287
        a_3 = 47.9579776388101
        c_3 = -117.475032351869
    end
    if idx == nothing
        y_pred, complete = eval_tree_array(tree, dataset.X, options)
        y = dataset.y
    else
        y_pred, complete = eval_tree_array(tree, dataset.X[:,idx], options)
        y = dataset.y[idx]
    end
    y_border = zeros(length(y))
    y_border[abs.(y).<p5] .= (a_1 * y[abs.(y).<p5]).^2 .+ b_1
    y_border[(abs.(y).>=p5).&(abs.(y).<=p80)] .= a_2 * abs.(y[(abs.(y).>=p5).&(abs.(y).<=p80)])
    y_border[abs.(y).>p80] .= a_3 * log.(abs.(y[abs.(y).>p80]) .+ b_3) .+ c_3
    if !complete
        return L(Inf)
    end
    custom_error = (sum( ( (1 ./ y_border) .* (y_pred .- y) ) .^2 ) )/(length(y))
    return custom_error
end
# Translating elements to julia
fitness_function_jl = jl.seval(fitness_function)
y2_pred_jl = jl_array(np.array(y2_pred, dtype=np.float32).T)
y2_act_jl = jl_array(np.array(y2_act, dtype=np.float32).T)
loss_recalculated = fitness_function_jl(y2_pred_jl, y2_act_jl)
print('Loss saved in PySR: ' + str(loss_in_pysr))
print('Loss recalculated: ' + str(loss_recalculated))

# Printed:
Loss saved in PySR: 0.01720243
Loss recalculated: 0.022473989882165576

-> Loss recalculated and loss calculated in PySR are not identical!

Version

1.3.1

Operating System

Windows

Package Manager

Conda

Interface

Other (specify below)

Relevant log output


Extra Info

loss_difference_pysr.pdf

BrotherHa avatar Jan 24 '25 15:01 BrotherHa

@BrotherHa is it okay if you paste the code inline in the markdown block? This is so that search indexes like Google can correctly find this issue if other people have a similar problem.

MilesCranmer avatar Jan 24 '25 15:01 MilesCranmer

Hello @MilesCranmer Thank you very much for the quick reply. Is the format ok like that? Best regards

BrotherHa avatar Jan 24 '25 17:01 BrotherHa

I have searched a bit where the issue occurs and in fact for most individuals of the regression the loss values from the equations_-table and the recalculated ones are absolutely identical until very high precision, but for some rare cases they divert significantly as in the example.

BrotherHa avatar Jan 25 '25 09:01 BrotherHa

Is it only when reloading from a pickle file?

MilesCranmer avatar Jan 25 '25 09:01 MilesCranmer

And are you sure you are computing the loss in a way completely identical between the Julia and the Python side of things?

MilesCranmer avatar Jan 25 '25 09:01 MilesCranmer

About the loss functions I am very sure. The only issues here might be problems caused by the precision of calculation, etc. And for more than 90% of all individuals both methods calculate exactly the same loss. And concerning the pickle-file, I have also tried using the 'from_file' method, but it still stays the same.

BrotherHa avatar Jan 25 '25 10:01 BrotherHa

Just to confirm, it is only when you load the model from a saved checkpoint that this happens? It’s not when you use warm_start=True and call fit a second time?

MilesCranmer avatar Jan 25 '25 16:01 MilesCranmer

90% of all individuals both methods calculate exactly the same loss

Do you see any pattern in the ones that don’t match? e.g., Are they very complex expressions?

MilesCranmer avatar Jan 25 '25 16:01 MilesCranmer

90% of all individuals both methods calculate exactly the same loss

Do you see any pattern in the ones that don’t match? e.g., Are they very complex expressions?

I do not see any pattern in the equations, that have the error. They are not the most complex ones neither do they have explicitly complex operators.

BrotherHa avatar Jan 27 '25 11:01 BrotherHa

Just to confirm, it is only when you load the model from a saved checkpoint that this happens? It’s not when you use warm_start=True and call fit a second time?

It happens also when using fit for the first time. I will try to reload and refit, when I find time.

BrotherHa avatar Jan 27 '25 11:01 BrotherHa

But it's only when you have a model loaded from disk, right? Like if you did:

model = PySRRegressor(warm_start=True)  # fresh model; NOT loaded from disk

model.fit(X, y)

#= ... =#

model.fit(X, y)

MilesCranmer avatar Jan 27 '25 14:01 MilesCranmer

The Problem occured in the first place when fitting a new model. Then directly after the regression finished I called predict() and calculated the loss for training, test and total dataset based on that. But before that I saved the output of model.get_best(). The value from get_best() and the manual calculation for the best individual shows the bahavior in a less significant way (0.014545 instead of 0.014536), but it is still not in the range most other values are equal. For the example given in my first messege here I called the predict() method also just after the initial run and calculated the loss as 0.022473989882165576, but I did not save the loss from model.equations_ separately. But in the hall of fame csv-file loss is also saved as 0.01720243, like it is outputed when I call model.equations_ from the loaded model.

BrotherHa avatar Jan 28 '25 17:01 BrotherHa

@BrotherHa would it be possible for you to make a MWE of this issue so I can try to reproduce it? No worries if not. It's a bit tricky because the unittests already check for warm starts from files and they seem to pass. So I'm puzzled as to what conditions trigger this.

MilesCranmer avatar Feb 05 '25 11:02 MilesCranmer

I have encountered a similar problem when using a custom cosine-similarity-based loss.

Example Code
import numpy as np
import pandas as pd
from pysr import PySRRegressor

nsamples = 30
np.random.seed(4283223)
x1 = np.random.uniform(size=nsamples)
x2 = np.random.uniform(size=nsamples)
x3 = np.random.uniform(size=nsamples)
x4 = np.random.uniform(size=nsamples)

X = pd.DataFrame({"x_1": x1, "x_2": x2, "x_3": x3, "x_4": x4})
target = (x1 * x4 - x3 * x2) ** 3

model = PySRRegressor(
    random_state=4208438902,
    deterministic=True,
    parallelism="serial",
    niterations=30,
    binary_operators=["+", "-", "*", "/"],
    unary_operators=["cube", "square"],
    loss_function="""
using Statistics

function eval_loss(tree, dataset::Dataset{T,L}, options)::L where {T,L}
    prediction, _ = eval_tree_array(tree, dataset.X, options)
    ratio = prediction ./ dataset.y
    norm = sqrt(sum(abs2, prediction) * sum(abs2, dataset.y))
    cos_value = sum(prediction .* dataset.y) / norm
    loss = abs(acos(clamp(cos_value, 0.0, 1.0)))
    scale = mean(abs, prediction)
    scale_loss = scale + 1 / (scale + 1e-6)
    return L(loss + 1e-6 * scale_loss)
end
""",
)

model.fit(X, target)
print(model)
print("Formula:", model.sympy())

prediction = model.predict(X)
norm = np.sqrt(np.sum(target**2) * np.sum(prediction**2))
cos_value = np.sum(target * prediction) / norm
loss = np.abs(np.arccos(np.clip(cos_value, 0, 1)))
scale = np.mean(np.abs(prediction))
scale_loss = scale + 1 / (scale + 1e-6)
print("Computed loss in Python:", loss + 1e-6 * scale_loss)
The target function in this example is `, and I aim to maximize the cosine similarity between the prediction and the target. To prevent the coefficients from becoming too large and introducing numerical instabilities, I added a `scale_loss`. Additionally, I included more operators than necessary to increase the likelihood of encountering this issue. Setting aside the validity of the loss design, in the example above, the reported loss can differ significantly from the loss recalculated in Python. For example:
PySRRegressor.equations_ = [
	  pick      score                    equation      loss  complexity
	0         0.000000                2.642676e-25  1.000000           1
	1        11.553640          cube(1.9243316e22)  0.000010           2
	2  >>>>   1.328369  square(cube(1.9243316e22))  0.000003           3
]
Formula: 5.07784534534534e+133
Computed loss in Python: 5.077845345345341e+127

The reported loss is very small, but it should be a very large one.

Although I have specified the random seed and make PySR deterministic, the final result is still random.

Here are outputs of all five runs
PySRRegressor.equations_ = [
	  pick      score                    equation      loss  complexity
	0         0.000000                2.642676e-25  1.000000           1
	1        11.553640          cube(1.9243316e22)  0.000010           2
	2  >>>>   1.328369  square(cube(1.9243316e22))  0.000003           3
]
Formula: 5.07784534534534e+133
Computed loss in Python: 5.077845345345341e+127
PySRRegressor.equations_ = [
	  pick     score                                       equation      loss  complexity
	0        0.000000                                       7.33e-43  1.000000           1
	1        0.987759                                1.1066597 / 0.0  0.138689           3
	2        1.754098                cube((x_1 * x_4) - (x_2 * x_3))  0.000022           8
	3  >>>>  1.188177  cube(((x_1 * x_4) - (x_2 * x_3)) * 2.7796757)  0.000002          10
]
Formula: 21.4774339167278*(x_1*x_4 - x_2*x_3)**3
Computed loss in Python: 1.9999991259531853e-06
PySRRegressor.equations_ = [
	  pick      score                                           equation      loss  complexity
	0         0.000000                                      1.3934125e-37  1.000000           1
	1        11.240707                                 cube(1.3326811e38)  0.000013           2
	2         0.557888                         cube(square(1.3326811e38))  0.000008           3
	3  >>>>   0.611431             cube(square(1.3326811e38)) - 0.9821605  0.000002           5
	4         0.008833  (cube(square(1.3326811e38)) - square(1.3070838...  0.000002           8
]
Formula: 5.60218481169532e+228 - 1*0.9821605
/home/ac/Projects/PKU/HallSR/pysr_bug.py:43: RuntimeWarning: overflow encountered in square
  norm = np.sqrt(np.sum(target**2) * np.sum(prediction**2))
Computed loss in Python: 5.6021848116953185e+222
PySRRegressor.equations_ = [
	  pick     score                                           equation      loss  complexity
	0        0.000000                                      1.5459748e-30  1.000000           1
	1        0.057146                                    cube(cube(x_4))  0.891998           3
	2        0.302634                                    cube(x_1 * x_4)  0.659070           4
	3        0.091833                              cube(cube(x_1 + x_4))  0.601242           5
	4        0.398972                     cube(x_4 * (x_1 - 0.44725403))  0.403439           6
	5        0.331675                      cube(x_4 * (x_1 - cube(x_3)))  0.289556           7
	6        9.506604                    cube((x_4 * x_1) - (x_3 * x_2))  0.000022           8
	7  >>>>  1.188146      cube(((x_1 * x_4) - (x_3 * x_2)) / 0.3583628)  0.000002          10
	8        0.000006  (cube((x_4 * x_1) - (x_3 * x_2)) / 0.010429359...  0.000002          12
	9        0.000026  ((cube((x_1 * x_4) - (x_2 * x_3)) * 0.83377784...  0.000002          14
]
Formula: 21.7285745471395*(x_1*x_4 - x_2*x_3)**3
Computed loss in Python: 2.000126025169295e-06
PySRRegressor.equations_ = [
	  pick      score                                           equation      loss  complexity
	0         0.000000                                      1.0714004e-24  1.000000           1
	1         0.780003                                 cube(5.0544116e29)  0.458405           2
	2        10.262700                           cube(cube(5.0544116e29))  0.000016           3
	3         0.082098                     cube(cube(cube(5.0544116e29)))  0.000015           4
	4  >>>>   0.375597  square(cube(square(square(cube(cube(1.3362112e...  0.000002           9
	5         0.016263  square(cube(3.4662442e29) / cube(x_4 / cube(-1...  0.000002          16
]
Formula: 1.54533717529716e+8235*x_3
/home/ac/Projects/PKU/HallSR/.venv/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:86: RuntimeWarning: invalid value encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
Computed loss in Python: nan

AllanChain avatar Aug 16 '25 03:08 AllanChain

@AllanChain could you set precision=64 and try again? The backend defaults to 32-bit for speed

MilesCranmer avatar Aug 16 '25 15:08 MilesCranmer

Thanks for the timely reply. Using precision=64 fixes the issue. I didn't consider the precision problem because, even if it were a precision issue, the loss shouldn't become extremely small—values like Inf or NaN would be more expected.

AllanChain avatar Aug 16 '25 15:08 AllanChain

I suspect it’s related to not handling the second argument of eval_tree_array in your custom loss function. The second argument tells you if the evaluation succeeded or not - if not, then the first argument is basically random, and should not be used. This is used to deal with stuff like Inf and NaN in a safe way

MilesCranmer avatar Aug 16 '25 15:08 MilesCranmer

Thank you! It's my bad not looking into the documentation carefully, and I'm sorry for bringing in an unrelated issue.

AllanChain avatar Aug 16 '25 15:08 AllanChain

No worries! And btw be sure to look through the discussions page as there are many examples of loss functions which might be helpful

MilesCranmer avatar Aug 16 '25 18:08 MilesCranmer