pyod icon indicating copy to clipboard operation
pyod copied to clipboard

FileNotFoundError for bps_prediction.joblib when opening pickled model

Open muraiki opened this issue 2 years ago • 10 comments

I trained a model on one computer and then pickled it using joblib.dump. On another computer, I opened the model using joblib.load and got a FileNotFoundError because bps_prediction.joblib is trying to be opened from the path to the joblib file on the original computer, which differs from the new computer.

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/var/folders/kn/bmgjf0611zsc41h256xmsz8c_nfjy5/T/ipykernel_15641/741748264.py in <module>
----> 1 clf.decision_function(X_positive)

~/.pyenv/versions/3.7.10/envs/$VIRTUALENV_NAME/lib/python3.7/site-packages/pyod/models/suod.py in decision_function(self, X)
    258 
    259         # initialize the output score
--> 260         predicted_scores = self.model_.decision_function(X)
    261 
    262         # standardize the score and combine

~/.pyenv/versions/3.7.10/envs/$VIRTUALENV_NAME/lib/python3.7/site-packages/suod/models/base.py in decision_function(self, X)
    452         if self.bps_flag:
    453             # load the pre-trained cost predictor to forecast the train cost
--> 454             cost_predictor = joblib.load(self.cost_forecast_loc_pred_)
    455 
    456             time_cost_pred = cost_forecast_meta(cost_predictor, X,

~/.pyenv/versions/3.7.10/envs/$VIRTUALENV_NAME/lib/python3.7/site-packages/joblib/numpy_pickle.py in load(filename, mmap_mode)
    575             obj = _unpickle(fobj)
    576     else:
--> 577         with open(filename, 'rb') as f:
    578             with _read_fileobject(f, filename, mmap_mode) as fobj:
    579                 if isinstance(fobj, str):

FileNotFoundError: [Errno 2] No such file or directory: '/home/local/$FOO/$USERNAME/.pyenv/versions/$VIRTUALENV_NAME/lib/python3.7/site-packages/suod/models/saved_models/bps_prediction.joblib'

I've omitted some of the exact values from the training system. $FOO is a directory that contains my home directory which is $USERNAME. $VIRTUALENV_NAME is the name of the virtual environment I created using pyenv virtualenv 3.7.10 $VIRTUALENV_NAME.

It looks like when the model is trained, the path to the pre-trained cost predictor is saved in the model object itself, which prevents the model from being used on a computer where that path is different.

I tried manually setting clf.cost_forecast_loc_pred to the correct path to bps_prediction.joblib, but still got the same error. I don't have access to create a symlink to point to the original location. How can I get the object to load bps_prediction.joblib from the correct path?

muraiki avatar Oct 04 '21 14:10 muraiki

noted. we have not considered the use case of saving SUOD. This may be a bit involved since bps_prediction.joblib should be part of the suod package. Would you mind sharing a minimal example with a synthetic dataset for reproducing purposes?

yzhao062 avatar Oct 05 '21 03:10 yzhao062

Thank you for your quick response! Here's an example.

First, run this on one computer:

from pyod.models.suod import SUOD
from pyod.models.lof import LOF
from pyod.models.iforest import IForest
from pyod.models.copod import COPOD
from pyod.utils.utility import standardizer
from pyod.utils.data import generate_data
import joblib

contamination = 0.1
n_train = 200
n_test = 100

X_train, y_train, X_test, y_test = \
    generate_data(n_train=n_train,
                  n_test=n_test,
                  contamination=contamination,
                  random_state=42)
X_train, X_test = standardizer(X_train, X_test)

detector_list = [
    LOF(contamination=contamination, n_neighbors=10),
    LOF(contamination=contamination, n_neighbors=20),
    COPOD(contamination=contamination),
    IForest(contamination=contamination, n_estimators=100, max_samples=0.1),
    IForest(contamination=contamination, n_estimators=100, max_samples=0.1, max_features=0.5)
]

clf = SUOD(
    base_estimators=detector_list,
    contamination=contamination,
    n_jobs=1,
    combination='average',
    verbose=1
)

clf_name = 'SUOD'

detector_list = [LOF(n_neighbors=15), LOF(n_neighbors=20),
                 LOF(n_neighbors=25), LOF(n_neighbors=35),
                 COPOD(), IForest(n_estimators=100),
                 IForest(n_estimators=200)]

clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
           verbose=False)

clf.fit(X_train)

joblib.dump(clf, 'model.pkl.bz2')

Then copy model.pkl.bz2 to another computer where the path to the virtualenv containing pyod/suod differs:

import joblib
from pyod.utils.utility import standardizer
from pyod.utils.data import generate_data

contamination = 0.1
n_train = 200
n_test = 100

clf = joblib.load('model.pkl.bz2')

X_train, y_train, X_test, y_test = \
    generate_data(n_train=n_train,
                  n_test=n_test,
                  contamination=contamination,
                  random_state=42)
X_train, X_test = standardizer(X_train, X_test)

clf.predict(X_test)

This problem will likely also occur if another user on the same computer that generated the model tries to load and predict with the model, assuming the second user lacks permission to access the virtualenv contained in the original user's home directory.

muraiki avatar Oct 05 '21 15:10 muraiki

Hello !

I faced the same issue : you either specify the location of the bps_prediction.joblib (I read that you already did that and didn't work) or you can just save the model + bps_prediction.joblib in a specific folder where the other user have the permissions needed, and then you specify the new location of the cost_forecast_loc_pred. It worked for me.

lecorveclucas avatar Oct 07 '21 09:10 lecorveclucas

Thanks @lecorveclucas ! Unfortunately, I'm operating in environments where I generally have limited permissions, so this isn't always an option for me. For instance, a process might build a model expecting a certain directory structure for the virtual environment, and then when the model is run on another system, I might not be able to recreate that structure.

muraiki avatar Oct 07 '21 14:10 muraiki

Alright, but there is something I don’t get : you can dump the trained model somewhere and then reused it, but you can’t dump the bps_prediction.joblib in the same folder ? I am sorry I might have not understood your answer because I don’t understand why you can load the trained model with an other user but not the bps_prediction.joblib which would be in the same folder ?

lecorveclucas avatar Oct 07 '21 14:10 lecorveclucas

@lecorveclucas : The problem is that the model internally stores where it expects to find bps_prediction.joblib: it doesn't look for it in the current directory. It tries to find it in the folder containing the installed SUOD library, which can vary by user and machine.

According to the linked code I should be able to just modify .cost_forecast_loc_pred_. The type of the object I modified is pyod.models.suod.SUOD, so I think I'm setting this correctly... I'm at a loss as to what is going on.

muraiki avatar Oct 08 '21 16:10 muraiki

This is the code that sets a default self.cost_forecast_loc_pred_. this_directory ends up as the location to the suod folder, such as: $HOME/.pyenv/versions/$VIRTUALENV_NAME/lib/python3.7/site-packages/suod/models/saved_models/bps_prediction.joblib'

muraiki avatar Oct 08 '21 16:10 muraiki

Indeed, I had the same issue with the 'this_directory' : I was wondering to change it (to something like this_directory = os.getcwd() ) because it is used ONLY for the path of cost_forecast_loc_fit and cost_forecast_loc_pred. But because I can save bps_prediction.joblib in a folder with the trained model (and then specify the new location of the cost_forecast_loc_pred) I didn't change the this_directory path. Did you try to change it ?

I am sorry it might be a stupid question but did you consider using Docker in order to have the same environment on both of you machines ? It could simplify a lot the use and reuse of your models :)

lecorveclucas avatar Oct 08 '21 16:10 lecorveclucas

@lecorveclucas : No, I didn't alter the source to change this_directory. But I did get the following to work:

  1. In the code on the system that trains the model, I copied bps_prediction.joblib to the working directory of the scripts and added the following to the call to SUOD(): cost_forecast_loc_pred='./bps_prediction.joblib'.
  2. When I loaded the picked model on another system, the string representation of the SUOD object had: cost_forecast_loc_pred='./bps_prediction.joblib'. Deleting this file causes a failure when it doesn't exist, meaning that the one in the current working directory is being used.

After doing the above, I'm able to run decision_function()! In the end, it looks like I need to set cost_forecast_loc_pred at model training time to an easily accessible path, as something prevents the model object from recognizing that this has been changed once the model has been trained.

Docker would address the environment problem, but it'd also present other tasks in terms of getting approval and managing the security of the images. What I've described is working for me, but I'd like to figure out why I can't change cost_forecast_loc_pred after the fact... I need to dig into this more, but perhaps the original value or the pre-trained model is being cached somewhere else in the object.

muraiki avatar Oct 11 '21 18:10 muraiki

@muraiki, I have just found your response to this problem. I have tried the first part when training, but not sure about how to proceed with what you indicated in the second point. I am following what it is specified when loading a model here: https://suod.readthedocs.io/en/latest/model_persistence.html. Thanks in advance.

jccguma avatar Nov 02 '23 16:11 jccguma