memo
memo copied to clipboard
Runner for parallel sklearn benchmark
Hey Vincent,
I found an interesting instance with your Runner object to run functions in parallel.
Let's take an example of an experiment you showed here (https://www.youtube.com/watch?v=qcrR-Hd0LhI&ab_channel=PyData).
Is it possible to run this benchmark in parallel using the Runner object from the memo library? In the documentation, you are giving it all parameters using a grid. However, it does not make sense to put X and y into the grid.
If we run it like this - e.g.:
X = [
"i really like this post",
"thanks for that comment",
"i enjoy this friendly forum",
"this is a bad post",
"i dislike this article",
"this is not well written"
]
y = np.array([1, 1, 1, 0, 0, 0])
settings = grid(
model = ['lr', 'random_forest', 'ada', "xgb"],
emb = ['bp', 'ft', 'spacy', 'cv-ngram'],
train_size = np.arange(1, 4, 1),
test_size = [1]
)
)
%%time
Runner(backend="threading", n_jobs=4).run(experiment, settings)
we gonna have a problem as we are not giving it an X and y.
I have also tried:
from functools import partial
partial_version = partial(experiment, X=X, y=y)
Runner(backend="threading", n_jobs=4).run(partial_version, settings)
This work however logs the text into the logging JSON file, which is not very efficient.
{"X":["i really like this post","thanks for that comment","i enjoy this friendly forum","this is a bad post","i dislike this article","this is not well written"],"y":[1,1,1,0,0,0],"model":"xgb","emb":"cv-ngram","train_size":1,"test_size":1,"accuracy_test":0.0,"accuracy_train":1.0,"pred_time":0.06557869911193848,"time_taken":0.68}
{"X":["i really like this post","thanks for that comment","i enjoy this friendly forum","this is a bad post","i dislike this article","this is not well written"],"y":[1,1,1,0,0,0],"model":"xgb","emb":"spacy","train_size":1,"test_size":1,"accuracy_test":0.0,"accuracy_train":1.0,"pred_time":0.15421748161315918,"time_taken":0.79}
{"X":["i really like this post","thanks for that comment","i enjoy this friendly forum","this is a bad post","i dislike this article","this is not well written"],"y":[1,1,1,0,0,0],"model":"xgb","emb":"spacy","train_size":3,"test_size":1,"accuracy_test":0.0,"accuracy_train":0.6666666666666666,"pred_time":0.11142969131469727,"time_taken":0.83}
is there any other workaround?
The Runner
got added a few months later, so I'm not 100% sure if it's completely compatible.
There's two comments on doing these grids for sklearn stuff though.
- The reason why I'm using
memo
and not usingGridSearchCV
in that video is because the embeddings don't pickle nicely. That's why I needed an alternative format. - For some sklearn stuff you won't want to use the
Runner
. Some models allow an_jobs=-1
parameter which will take all the CPUs from your machine. That means that you might incur a penalty for using theRunner
.
Could you share the entire script that you ran though? The talk has been a while, so it'd help to see the full code that you tried running.
Hey Vincent,
Sorry for my delay and for not providing an example in the first place.
Here is a simple example for my use case with small data.
I found a workout for my question and that is to explicitly force the output dictionary to results X and y as None.
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from tokenwiser.textprep import Cleaner
from memo import memlist, memfunc, memfile, time_taken, grid
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import time
from memo import grid, Runner
def generate_model(emb="cv-ngram", model="lr"):
models = {
"lr": LogisticRegression(solver="liblinear", class_weight="balanced"),
}
union = FeatureUnion(
[
("cv", CountVectorizer()),
("cv-ngram", CountVectorizer(analyzer="char", ngram_range=(2, 3))),
]
)
mod = make_pipeline(Cleaner(), union, models[model])
return mod
@memfile('/SFS/user/ry/hrobar/msd_projects/nlp/scripts/models_results.jsonl')
@time_taken()
def experiment(
X: list,
y: list,
model,
train_size: int = 3,
test_size: int = 2,
):
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=test_size,
# stratify=y,
random_state=10,
)
X_train, y_train = list(X_train[:train_size]), list(y_train[:train_size])
X_test, y_test = list(X_test), list(y_test)
mod = generate_model(model=model)
mod.fit(X_train, y_train)
y_train_pred = mod.predict(X_train)
tic = time.time()
y_test_pred = mod.predict(X_test)
toc = time.time()
return {
"X": None, # This is a nice workaround for logging and runner
"y": None, # This is a nice workaround for logging and runner
"accuracy_test": np.mean(y_test == y_test_pred),
"accuracy_train": np.mean(y_train == y_train_pred),
"test_size": len(y_test),
"train_size": len(y_train),
"pred_time": toc - tic,
}
X = [
"i really like this post",
"thanks for that comment",
"i enjoy this friendly forum",
"this is a bad post",
"i dislike this article",
"this is not well written"
]
y = np.array([1, 0, 1, 0, 1, 0])
settings = grid(
model = ['lr'],
train_size = np.arange(2, 4, 1),
test_size = [2]
)
from functools import partial
partial_version = partial(experiment, X=X, y=y)
Runner(backend="threading", n_jobs=12).run(partial_version, settings)
Another alternative is to have X
, y
around also global variables, or to pass the name of the file that needs to be loaded (that's what I usually do).