memo icon indicating copy to clipboard operation
memo copied to clipboard

Runner for parallel sklearn benchmark

Open petrhrobar opened this issue 2 years ago • 3 comments

Hey Vincent,

I found an interesting instance with your Runner object to run functions in parallel.

Let's take an example of an experiment you showed here (https://www.youtube.com/watch?v=qcrR-Hd0LhI&ab_channel=PyData).

Is it possible to run this benchmark in parallel using the Runner object from the memo library? In the documentation, you are giving it all parameters using a grid. However, it does not make sense to put X and y into the grid.

If we run it like this - e.g.:

X = [
    "i really like this post",
    "thanks for that comment",
    "i enjoy this friendly forum",
    "this is a bad post",
    "i dislike this article",
    "this is not well written"
]

y = np.array([1, 1, 1, 0, 0, 0])

settings = grid(
 model = ['lr', 'random_forest', 'ada', "xgb"],
 emb = ['bp', 'ft', 'spacy', 'cv-ngram'],
 train_size = np.arange(1, 4, 1),
 test_size = [1]
)
)

%%time
Runner(backend="threading", n_jobs=4).run(experiment, settings)

we gonna have a problem as we are not giving it an X and y.

I have also tried:


from functools import partial

partial_version = partial(experiment, X=X, y=y)

Runner(backend="threading", n_jobs=4).run(partial_version, settings)

This work however logs the text into the logging JSON file, which is not very efficient.

{"X":["i really like this post","thanks for that comment","i enjoy this friendly forum","this is a bad post","i dislike this article","this is not well written"],"y":[1,1,1,0,0,0],"model":"xgb","emb":"cv-ngram","train_size":1,"test_size":1,"accuracy_test":0.0,"accuracy_train":1.0,"pred_time":0.06557869911193848,"time_taken":0.68}
{"X":["i really like this post","thanks for that comment","i enjoy this friendly forum","this is a bad post","i dislike this article","this is not well written"],"y":[1,1,1,0,0,0],"model":"xgb","emb":"spacy","train_size":1,"test_size":1,"accuracy_test":0.0,"accuracy_train":1.0,"pred_time":0.15421748161315918,"time_taken":0.79}
{"X":["i really like this post","thanks for that comment","i enjoy this friendly forum","this is a bad post","i dislike this article","this is not well written"],"y":[1,1,1,0,0,0],"model":"xgb","emb":"spacy","train_size":3,"test_size":1,"accuracy_test":0.0,"accuracy_train":0.6666666666666666,"pred_time":0.11142969131469727,"time_taken":0.83}


is there any other workaround?

petrhrobar avatar Feb 21 '22 13:02 petrhrobar

The Runner got added a few months later, so I'm not 100% sure if it's completely compatible.

There's two comments on doing these grids for sklearn stuff though.

  1. The reason why I'm using memo and not using GridSearchCV in that video is because the embeddings don't pickle nicely. That's why I needed an alternative format.
  2. For some sklearn stuff you won't want to use the Runner. Some models allow a n_jobs=-1 parameter which will take all the CPUs from your machine. That means that you might incur a penalty for using the Runner.

Could you share the entire script that you ran though? The talk has been a while, so it'd help to see the full code that you tried running.

koaning avatar Feb 21 '22 15:02 koaning

Hey Vincent,

Sorry for my delay and for not providing an example in the first place.

Here is a simple example for my use case with small data.

I found a workout for my question and that is to explicitly force the output dictionary to results X and y as None.

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

from tokenwiser.textprep import Cleaner

from memo import memlist, memfunc, memfile, time_taken, grid
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
import time
from memo import grid, Runner


def generate_model(emb="cv-ngram", model="lr"):
    models = {
        "lr": LogisticRegression(solver="liblinear", class_weight="balanced"),
    }

    union = FeatureUnion(
        [
                ("cv", CountVectorizer()),
                ("cv-ngram", CountVectorizer(analyzer="char", ngram_range=(2, 3))),
            ]
    )
    mod = make_pipeline(Cleaner(), union, models[model])
    return mod



@memfile('/SFS/user/ry/hrobar/msd_projects/nlp/scripts/models_results.jsonl')
@time_taken()
def experiment(
    X: list,
    y: list,
    model,
    train_size: int = 3,
    test_size: int = 2,
):

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=test_size,
        # stratify=y,
        random_state=10,
    )


    X_train, y_train = list(X_train[:train_size]), list(y_train[:train_size])
    X_test, y_test = list(X_test), list(y_test)


    mod = generate_model(model=model)
    mod.fit(X_train, y_train)


    y_train_pred = mod.predict(X_train)
    tic = time.time()
    y_test_pred = mod.predict(X_test)
    toc = time.time()
    return {
        "X": None, # This is a nice workaround for logging and runner
        "y": None, # This is a nice workaround for logging and runner
        "accuracy_test": np.mean(y_test == y_test_pred),
        "accuracy_train": np.mean(y_train == y_train_pred),
        "test_size": len(y_test),
        "train_size": len(y_train),
        "pred_time": toc - tic,
    }



X = [
    "i really like this post",
    "thanks for that comment",
    "i enjoy this friendly forum",
    "this is a bad post",
    "i dislike this article",
    "this is not well written"
]

y = np.array([1, 0, 1, 0, 1, 0])

settings = grid(
 model = ['lr'],
 train_size = np.arange(2, 4, 1),
 test_size = [2]
)



from functools import partial
partial_version = partial(experiment, X=X, y=y)
Runner(backend="threading", n_jobs=12).run(partial_version, settings)

petrhrobar avatar Feb 22 '22 08:02 petrhrobar

Another alternative is to have X, y around also global variables, or to pass the name of the file that needs to be loaded (that's what I usually do).

koaning avatar Feb 22 '22 09:02 koaning