pykan Symbolic Formula predicts worse than model itself

Hi

I have been practicing about KAN. I have made regression implementation. My output layer has 1 node.

After implementation I got pretty good R2 and MAE on my dataset (including train val test). I wanted to get symbolic formula and I got it according to => https://kindxiaoming.github.io/pykan/Examples/Example_3_classfication.html

After that I created a code which gets the symbolic formula of KAN and calculate given inputs according to given formula. The function is below:

def kan_symbolic_formula_prediction(formula, X):
    batch = X.shape[0]
    predictions = []  # Empty list for keeping predictions

    for i in range(batch):
        # Evaluation on symbolic formula on every single row
        expression = formula
        for j in range(X.shape[1]):
            expression = expression.subs(f'x_{j+1}', X[i, j])
        
        # Get output of formula
        predicted = float(expression.evalf())
        
        predictions.append(predicted)
    
    return predictions

Then I get prediction by using formula like that:

# Get results using symbolic formula
preds_from_kan_formula = kan_symbolic_formula_prediction(formula, X_train.to_numpy())

and here is metrics model.forward() and symbolic formula respectively:

print("MAE from formula on train data",mean_absolute_error(train_labels.numpy(),preds_from_kan_formula))
print("R2 from formula on train data",r2_score(train_labels.numpy(), preds_from_kan_formula))
MAE from formula on train data 0.15750130512335783
R2 from formula on train data -0.47894340657227064

print("MAE from model.forward() on train data",mean_absolute_error(train_labels.numpy(), train_preds.numpy()))
print("R2 from model.forward() on train data",r2_score(train_labels.numpy(), train_preds.numpy()))
MAE from model.forward() on train data 0.04164282
R2 from model.forward() on train data 0.8345471009872176

As you see symbolic formula performs badly. What do you think about it ?

Here is my full code => https://www.kaggle.com/code/seyidcemkarakas/kan-regression-graduate-admissions

May 16 '24 15:05 seyidcemkarakas

Hi, since only a limited library of symbolic formulas is provided, it could be that the real symbolic formula is not supported in the library, or even the formula is not symbolic at all. It might be helpful to stare at the learned KAN plot a bit by calling model.plot(), trying to get a sense of what's going on. For example, are there any activation functions that look particularly suspicious?

May 17 '24 02:05 KindXiaoming

Hi @KindXiaoming

It has been for a long time since this issue opening. I have rerun my codes and I am here again.

Here is the story:

I have a case => Regression case

my df contains 7 inputs and 1 output. In order to keep progress simple, I have only took first 2 columns as features.

target_column_name = "Chance of Admit "

X = df[list(df.columns.drop([target_column_name]+["Serial No."]))[0:2]]
y = df[target_column_name] 

# Split whole data to train and remainings
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=0)

# Split remainings data to val and test
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=0)

# Covert data to torch tensor
train_input = torch.tensor(X_train.to_numpy(), dtype=torch.float32)
train_label = torch.tensor(y_train.to_numpy()[:, None], dtype=torch.float32)
val_input = torch.tensor(X_val.to_numpy(), dtype=torch.float32)
val_label = torch.tensor(y_val.to_numpy()[:, None], dtype=torch.float32)
test_input = torch.tensor(X_test.to_numpy(), dtype=torch.float32)
test_label = torch.tensor(y_test.to_numpy()[:, None], dtype=torch.float32)

dataset = {
    'train_input': train_input,
    'train_label': train_label,
    'val_input': val_input,
    'val_label': val_label,
    'test_input': test_input,
    'test_label': test_label
}

and I have trained my KAN model =>

# Create KAN
model = KAN(width=[len(X.columns),3,1], grid=5, k=11)

# Train KAN
results = model.train({'train_input': train_input, 'train_label': train_label, 'test_input': val_input, 'test_label': val_label},
                      opt="LBFGS", steps=50, loss_fn=torch.nn.MSELoss())

after training, simple I got performance of my model like that =>

# Predictions of train val and test datasets
test_preds = model.forward(test_input).detach()
test_labels = test_label

train_preds = model.forward(train_input).detach()
train_labels = train_label

val_preds = model.forward(val_input).detach()
val_labels = val_label


# Evaluate metrics
print("Train R2 Score:", r2_score(train_labels.numpy(), train_preds.numpy()))
print("Train MAE:", mean_absolute_error(train_labels.numpy(), train_preds.numpy()))

print("Val R2 Score:", r2_score(val_labels.numpy(), val_preds.numpy()))
print("Val MAE:", mean_absolute_error(val_labels.numpy(), val_preds.numpy()))

print("Test R2 Score:", r2_score(test_labels.numpy(), test_preds.numpy()))
print("Test MAE:", mean_absolute_error(test_labels.numpy(), test_preds.numpy()))

These are performance metrics

Train R2 Score: 0.7171265138589901
Train MAE: 0.05722126
Val R2 Score: 0.5901170704369715
Val MAE: 0.06389203
Test R2 Score: 0.686370707877747
Test MAE: 0.05353796

Now I want to plot model

model.plot(scale=2)

As you see all of the activation functions looks like ok, nothing is suspicious

Then I wanna try symbolic formulation only using x and abs

lib = ['x','abs']
model.auto_symbolic(lib=lib)

its done

#lib = ['x','x^2','x^3','x^4','exp','log','sqrt','tanh','sin','abs']
lib = ['x','abs']
model.auto_symbolic(lib=lib)
#model.auto_symbolic()

here is my formula =>

model.symbolic_formula()
([0.02*Abs(0.56*x_2 - 8.33) - 2.09], [x_1, x_2])

firstly : Why do I see only x_2 in this formula even though x_1's activation layers looks dark (not pale) ? As I know if an activation layer is pale that means this feature is not important for model. This is my first question.

Secondly: I have created this function in order to use gicen formula.

def kan_symbolic_formula_prediction(formula, X):
    batch = X.shape[0]
    predictions = []  # Empty list for keeping predictions

    for i in range(batch):
        # Evaluation on symbolic formula on every single fow
        expression = formula
        for j in range(X.shape[1]):
            expression = expression.subs(f'x_{j+1}', X[i, j])
        
        # Get output of formula
        predicted = float(expression.evalf())
        
        predictions.append(predicted)
    
    return predictions

I tested manuelly to check this func works correctly

# Manuel prediction (optional)
manuel_single_inputs = [1,1]
kan_symbolic_formula_prediction(formula, pd.DataFrame([manuel_single_inputs]).to_numpy())

>>> [-1.9345999999999999]

function works correctly => Checked

and here is the remaining story => Still symbolic function doesnt correspond to the main model

# Get results using symbolic formula
preds_from_kan_formula = kan_symbolic_formula_prediction(formula, X_train.to_numpy())

print("MAE from formula on train data",mean_absolute_error(train_labels.numpy(),preds_from_kan_formula))
print("R2 from formula on train data",r2_score(train_labels.numpy(), preds_from_kan_formula))

>>> MAE from formula on train data 1.7811659989688393
>>> R2 from formula on train data -155.2165170931922

print("MAE from model.forward() on train data",mean_absolute_error(train_labels.numpy(), train_preds.numpy()))
print("R2 from model.forward() on train data",r2_score(train_labels.numpy(), train_preds.numpy()))

>>> MAE from model.forward() on train data 0.05722126
>>> R2 from model.forward() on train data 0.7171265138589901

What is your thoughts?

Jul 21 '24 15:07 seyidcemkarakas

@KindXiaoming Hi Xiaoming. Sorry for this very low level question: I was facing somewhat the same problem, but since my input dimension is 20 (so I set the intermediate layer as width 41). When I plot the model the functions are not shown. Is there any way to view the whole plot instead of going into the figures folder and match the indices one by one?

Sep 22 '24 14:09 austinmyc

Symbolic Formula Representation for Binary Classification Task:

I attempted to obtain a symbolic formula representation in the context of Explainable AI (XAI). The dataset details are as follows:

Task: Binary classification Data Points: 2,095 Features: 21 (including class labels) Imbalance Handling: SMOTE was applied to address class imbalance.

The KAN model showed good performance in predictions. However, when extracting the symbolic formula using model.symbolic_formula, the following things observed,

Accuracy of the Formula: The formula achieved an accuracy of 0.50 on the training set and 0.84 on the test set. Upon analysis, it is noticed that the formula only predicts class 0. The 50% training accuracy corresponds to predicting class 0 for half of the balanced data (due to SMOTE). The 84% test accuracy aligns with the test set containing 84% samples labeled as class 0.

Behavior of Formula 2: The values from Formula 2 were consistently negative across all datasets. Because Formula 2 always produces negative values, they are always smaller than Formula 1 values, leading to predictions favoring class 0.

Any suggestions to improve the accuracy of symbolic formula would be apprecited. Thanks in advance.

Nov 20 '24 09:11 SaranDS

pykan pykan copied to clipboard

Symbolic Formula predicts worse than model itself

pykan
pykan copied to clipboard