FLAML icon indicating copy to clipboard operation
FLAML copied to clipboard

Using Scikit-learn APIs directly

Open sheikhartin opened this issue 3 years ago • 11 comments
trafficstars

Almost yesterday (August 5 at 7 PM, Iran/Tehran time), I had a short conversation with @sonichi about this, and in general it is better to provide such features more easily to the users... Maybe you (FLAML maintainers) don't have any contest, but you should have features so that more developers will use your product...

Anyway, the files I imported from them:

  • Pre-processing: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/init.py
  • Model selection: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/model_selection/init.py
  • Metrics: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/init.py

I don't think that there is a need to write a test, but we should be careful that a method (class and/or function) is not deprecated...

A quick example that shows how to use these APIs:

import pandas as pd
from flaml.utils import preprocessing
from flaml.utils import model_selection
from flaml.utils import metrics
from flaml import AutoML

# Loading data (nothing changed)
df = pd.read_csv('<a_random_dataset_that_needs_preprocessing.csv>')
X = df[['field_no1', 'field_no2', 'field_no3', 'field_no4']]
y = df['field_no5']

# Preprocessing
le = preprocessing.LabelEncoder()
X['field_no3'] = le.fit_transform(X['field_no3'])
y['field_no5'] = le.fit_transform(X['field_no5'])

# Seperating the train and test data
X_train, y_train, X_test, y_test = model_selection.train_test_split(X, y, test_size=.2)

# Training phase (nothing changed)
automl = AutoML()
automl.fit(X_train, y_train, task='classification')

# Measuring accuracy
y_pred = automl.predict(X_test)
print(metrics.classification_report(y_test, y_pred))

Or:

from flaml.utils import (
    LabelEncoder,
    train_test_split,
    classification_report,
)
from flaml import AutoML

Or even:

from flaml import (
    LabelEncoder,
    train_test_split,
    classification_report,
    AutoML,
)

sheikhartin avatar Aug 06 '22 04:08 sheikhartin

CLA assistant check
All CLA requirements met.

ghost avatar Aug 06 '22 04:08 ghost

Dude, contributing to Microsoft projects is really painful! :))

sheikhartin avatar Aug 06 '22 05:08 sheikhartin

Dude, contributing to Microsoft projects is really painful! :))

The CLA signing is a one-time procedure. You won't need it for every PR.

Some checks failed: https://github.com/microsoft/FLAML/runs/7702402957?check_suite_focus=true#step:5:51

sonichi avatar Aug 06 '22 15:08 sonichi

@sonichi Looks like we need to upgrade to a higher version of Sklearn (1.1.2 based on PyPI) Do you know any other solution, except removing the items that give errors?!

sheikhartin avatar Aug 06 '22 15:08 sheikhartin

@sonichi I don't have enough experience in such situations, can you help?

sheikhartin avatar Aug 06 '22 15:08 sheikhartin

@sonichi Looks like we need to upgrade to a higher version of Sklearn (1.1.2 based on PyPI) Do you know any other solution, except removing the items that give errors?!

I would suggest using version check of sklearn to decide what to import. Only import what is available. Do not increase the sklearn version in setup.py.

sonichi avatar Aug 06 '22 17:08 sonichi

@sheikhartin I figure you and @sonichi already discussed this in your conversation, but any chance you could briefly explain here the benefit of adding these? Assuming a user installed flaml (and should therefore also have a compliant version of sklearn), any utilities provided by sklearn are already available to the user for import and usage.

ZviBaratz avatar Aug 07 '22 07:08 ZviBaratz

Very good question @ZviBaratz! But, you should ask your question like this, why did PyCaret 🥕 become so famous and popular? In short, because it literally automates the process!

I just wanted to tell the FLAML developers that the game is not over (if it is important to you), and simplicity does not necessarily mean weakness! I hope you understand that my purpose of doing this contribution was not to get the logo of the Microsoft organization on my profile, and I was thinking about the growth of this project...

Anyway, you can easily block this PR!

sheikhartin avatar Aug 07 '22 09:08 sheikhartin

IMHO this addition does not improve existing functionality and mostly clutters up the module's namespace.

I'm sorry @ZviBaratz, I may have explained a little badly, I didn't mean to offend you or other contributors! 🙏 You are right, the functionality does not improve, maybe this PR was a mistake on my part as I thought it would help attract more people...

sheikhartin avatar Aug 07 '22 11:08 sheikhartin

@sheikhartin no offence taken, and I hope it's clear I wasn't trying to be offensive either. I think the discussion regarding the differences between pycaret and flaml is somewhat out of scope, but in the context of this PR my opinion is that enabling users to import unmodified sklearn code from flaml will not constitute an attractive feature or promote the usage of flaml over other AutoML solutions in any way. In any case, the thought and the effort are appreciated :pray:

ZviBaratz avatar Aug 07 '22 12:08 ZviBaratz

Late to convo but I agree with @ZviBaratz here that this doesn't improves any existing functionality. The imported utils are not used anywhere throughout the project so why is there a need to import them? I think it would be best for us to import only what we need.

For example, will there be a use for GridSearchCV, RandomizedSearchCV, ParameterGrid, ParameterSampler

The idea of having a utils.py file might be something but doesn't seem necessary right now? not sure.

int-chaos avatar Aug 08 '22 04:08 int-chaos