AutoMLSearch execution leads to Segmentation Fault

Open akramIOT opened this issue 2 years ago • 5 comments

[A clear and concise description of what the bug is.]

PROBLEM:

AutoMLSearch execution leads to Segmentation Fault Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Code Sample, a copy-pastable example to reproduce your bug.

Environment: (serverless-machine-learning) akram@ISHERIFF-M-RBNA models % uname -a Darwin ISHERIFF-M-RBNA 21.5.0 Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 x86_64 (serverless-machine-learning) akram@ISHERIFF-M-RBNA models % (serverless-machine-learning) akram@ISHERIFF-M-RBNA models % python3 -V Python 3.9.7 (serverless-machine-learning) akram@ISHERIFF-M-RBNA models %

# Your code here

## Evaluating Different Models by using  the  Auto-ML  framework  ""EVALML""  in this  module.

print("\nImporting to Auto-ML based Training ...##")

import evalml   ## AutoML  technique to be used here  This package is  required only if you are doing  automatic Data cleaning and Pre-processing without any Manual steps.
from PreProcess_Data import Xtrain,Xtest,Ytrain,Ytest
from evalml import AutoMLSearch
evalml.problem_types.ProblemTypes.all_problem_types

from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

X_train, X_test, y_train, y_test = Xtrain,Xtest,Ytrain,Ytest

print("\n\n\tRunning Auto ML based  training\n")

automl = AutoMLSearch(X_train=Xtrain, y_train=Ytrain, problem_type='binary')
print(automl.search())

automl.rankings
print(automl.best_pipeline)

best_pipeline=automl.best_pipeline

print(best_pipeline)

#GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean',
#                            'categorical_fill_value': None, 'numeric_fill_value': None}, 'Logistic Regression Classifier'
#:{'penalty': 'l2', 'C': 1.0, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'},})

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

### Evaluate on hold out of the data samples
best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])

automl_auc.rankings
automl_auc.describe_pipeline(automl_auc.rankings.iloc[0]["id"])

best_pipeline_auc = automl_auc.best_pipeline
# get the score on holdout data
best_pipeline_auc.score(X_test, y_test,  objectives=["auc"])

## Pickling the trained model
best_pipeline.save("AutomML_Eval_model.pkl")

check_model=automl.load('model.pkl')
check_model.predict_proba(X_test).to_dataframe()

Debugged it with pdb as well and with breakpoints, print statements

================================================================ OUTPUT:

/Users/akram/opt/anaconda3/envs/serverless-machine-learning/bin/python /Users/akram/AKRAM_CODE_FOLDER/ML/Washington_ML/serverless-machine-learning/ML_Proj_Template/ml1/models/Auto_Eval_Training.py

Importing to Auto-ML based Training ...## ::Reading of Input Data is Sucessfull::

  MI_dir_L5_weight  MI_dir_L5_mean  ...  HpHp_L0.01_covariance  HpHp_L0.01_pcc

0 1.000000 60.000000 ... 0.000000e+00 0.000000e+00 1 1.000000 60.000000 ... 0.000000e+00 0.000000e+00 2 1.000000 60.000000 ... 0.000000e+00 0.000000e+00 3 1.000000 590.000000 ... 0.000000e+00 0.000000e+00 4 1.927179 590.000000 ... 0.000000e+00 0.000000e+00 ... ... ... ... ... ... 9994 1.000000 330.000000 ... 4.240000e-29 0.000000e+00 9995 1.998594 330.000000 ... -1.110000e-28 -3.820000e-18 9996 1.000000 60.000016 ... 1.240000e-28 1.110000e-16 9997 1.000000 330.000000 ... 2.530000e-29 1.740000e-18 9998 1.999917 330.000000 ... -6.640000e-29 -4.560000e-18

[9999 rows x 115 columns] MI_dir_L5_weight MI_dir_L5_mean ... HpHp_L0.01_covariance HpHp_L0.01_pcc 0 1.000000 60.0 ... 0.0 0.0 1 1.000000 60.0 ... 0.0 0.0 2 1.000000 60.0 ... 0.0 0.0 3 1.000000 590.0 ... 0.0 0.0 4 1.927179 590.0 ... 0.0 0.0

[5 rows x 115 columns] The shape of Input dataset is : (9999, 115) The shape of Input malicious dataset is : (9999, 115) Clean/ Benign Traffic is 0 1 1 1 2 1 3 1 4 1 .. 9994 1 9995 1 9996 1 9997 1 9998 1 Name: Out, Length: 9999, dtype: int64 Malicious Traffic is 0 0 1 0 2 0 3 0 4 0 .. 9994 0 9995 0 9996 0 9997 0 9998 0 Name: Out, Length: 9999, dtype: int64 Concatenated Data Shape is (19998, 116) combined1 shape is (19998, 116) After remove: (19998, 114)

The OUTPUT is : [0 1 1 ... 0 0 1]

OUTPUT SHAPE : (19998,)

Running Auto ML based  training

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Jul 27 '22 03:07 akramIOT

To confirm, you're running on a Mac (non-M1), correct?

It seems that the code snippet that you provided is incomplete. With a few minor corrections, I can run it with no errors.

How is automl_auc assigned?
What is model.pkl?
Is plt called at some point?
Have you tried running it with a different dataset?

Aug 01 '22 14:08 cp2boston

Yes,I am on a MAC.

Hardware Overview:

Model Name: MacBook Pro Model Identifier: MacBookPro16,1 Processor Name: 6-Core Intel Core i7 Processor Speed: 2.6 GHz Number of Processors: 1 Total Number of Cores: 6 L2 Cache (per Core): 256 KB L3 Cache: 12 MB Hyper-Threading Technology: Enabled Memory: 16 GB System Firmware Version: 1731.140.2.0.0 (iBridge: 19.16.16064.0.0,0) OS Loader Version: 540.120.3~19 Serial Number (system): C02FRBNAMD6M Hardware UUID: B5D170ED-BA36-541D-81D0-2CB5FD5B0A39 Provisioning UDID: B5D170ED-BA36-541D-81D0-2CB5FD5B0A39 Activation Lock Status: Disabled

By using the confusion matrix API Call as below oncode lines #220 - 222

confusion_matrix = metrics.confusion_matrix(max_test, max_predictions) plt.figure(figsize=(16, 14)) sns.heatmap(confusion_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");

Trained model is saved (pickled) using the model.pkl api call and later on this same model is loaded into memory to make the predictions.
Yes, in code line #203 plt is called for plotting the results.
Yes, i tried with a different dummy dataset which comes from SKlearn.load_dataset API but the same error.

Aug 01 '22 21:08 akramIOT

Thanks for the info! A few of things:

Have you tried running your code with the plot operations commented out?
I am not seeing any code that has the line numbers you're referencing. There are only about 50 lines in the snippet
Could you enable faulthandler in your module? It might provide more insight as to the code that is triggering the segfault

Aug 02 '22 12:08 cp2boston

Yes, i tried running with the plot operations commented out but i still see the same issue.
PFI the code below.

Your code here

Evaluating Different Models by using the Auto-ML framework ""EVALML"" in this module.

print("\nImporting to Auto-ML based Training ...##")

import evalml ## AutoML technique to be used here This package is required only if you are doing automatic Data cleaning and Pre-processing without any Manual steps. from PreProcess_Data import Xtrain,Xtest,Ytrain,Ytest from evalml import AutoMLSearch evalml.problem_types.ProblemTypes.all_problem_types

from sklearn.metrics import accuracy_score from matplotlib import pyplot as plt

X_train, X_test, y_train, y_test = Xtrain,Xtest,Ytrain,Ytest

print("\n\n\tRunning Auto ML based training\n")

automl = AutoMLSearch(X_train=Xtrain, y_train=Ytrain, problem_type='binary') print(automl.search())

automl.rankings print(automl.best_pipeline)

best_pipeline=automl.best_pipeline

print(best_pipeline)

#GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean',

'categorical_fill_value': None, 'numeric_fill_value': None}, 'Logistic Regression Classifier'

#:{'penalty': 'l2', 'C': 1.0, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'},})

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

Evaluate on hold out of the data samples

best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])

automl_auc.rankings automl_auc.describe_pipeline(automl_auc.rankings.iloc[0]["id"])

best_pipeline_auc = automl_auc.best_pipeline

get the score on holdout data

best_pipeline_auc.score(X_test, y_test, objectives=["auc"])

Pickling the trained model

best_pipeline.save("AutomML_Eval_model.pkl")

check_model=automl.load('model.pkl') check_model.predict_proba(X_test).to_dataframe()

Aug 02 '22 15:08 akramIOT

That looks like the same code as your original snippet. No line 202 or 220 - 222

Aug 02 '22 15:08 cp2boston

evalml evalml copied to clipboard

AutoMLSearch execution leads to Segmentation Fault

PROBLEM:

Code Sample, a copy-pastable example to reproduce your bug.

================================================================ OUTPUT:

Your code here

Evaluating Different Models by using the Auto-ML framework ""EVALML"" in this module.

'categorical_fill_value': None, 'numeric_fill_value': None}, 'Logistic Regression Classifier'

Evaluate on hold out of the data samples

get the score on holdout data

Pickling the trained model

evalml
evalml copied to clipboard