vaex icon indicating copy to clipboard operation
vaex copied to clipboard

Pipeline proposal

Open xdssio opened this issue 3 years ago • 1 comments

This is an implementation of a new Pipeline which wraps a few standard solutions needs and the vaex state.

General idea:
Any transformation you do on the dataframe as long as you start at the "raw" data you will use in production, is saved so that you can use the same infrastructure to solve all the problems.

  • Keep an example of the data.
  • fit, transform, fit_transform for the sklearn API.
  • inference function that output what you would need in a server.
    • figure out and handle missing values, missing columns, and extra columns.
    • Never filter the data.
  • read many inputs like bytes, JSON, dict, list and so on.
  • save/load

An Imputer transformer which used in default on the pipeline but can be used in many ways by providing a strategy.

Added to the Dataframe:

  • to_records implementation for easy testing (the use of records is very common in the industry)
    • [{key:value,key:value,...},...]
  • countna: which counts all missing value in dataframe

Missing

  • predict to complete the sklearn API.
  • partial_fit if possible.
  • show: a way to view all the steps.
  • An API to manipulate steps.
  • A way to "restart" a state in case your "raw" data in production is different from the raw data you start with.

Examples:

train, test = vaex.ml.datasets.load_iris().split_random(0.8)
train['average_length'] = (train['sepal_length'] + train['petal_length']) / 2
booster = LightGBMModel(features=['average_length', 'petal_width'], target='class_'})
booster.fit(train)
train = booster.transform(train)

pipeline = Pipeline.from_dataframe(train)
assert "lightgbm_prediction" in pipeline.inference(test)

With fit:

def fit (df):
  df['average_length'] = (df['sepal_length'] + df['petal_length']) / 2
  booster = LightGBMModel(features=['average_length', 'petal_width'], target='class_')
  booster.fit(df)
  df = booster.transform(df)
  return df

train, test = vaex.ml.datasets.load_iris().split_random(0.8)

pipeline = Pipeline.from_dataframe(train, fit=fit)
pipeline.fit(train)
pipeline.inference(test) # predictions

xdssio avatar Apr 29 '21 09:04 xdssio

Ok I've taken over the developement here (temporarily?) since I like this and want to push this forward! It is a fairly big PR so i'm doing this in steps.

So I have refactored the Imputer and improved the test a bit. It was already in good shape. The (rough) changelog:

  • Added docstrings to all methods, and a class docstring with an example
  • Fixed missing imports
  • Changed some variable and function/method names to make it a bit more understandable and easier to support going forward
  • Removed some of the methods that were too low level, making it easier (i hope) to follow the code along.
  • Added an additional test testing the state transfer - currently failing (see explanation below).
  • Improved the tests -> testing against fix values (just in case).

There is one problem as follows: When doing .transform(df_test) if df_test does not have a missing column that needs to be imputed by the Imputer, the current behaviour is that that column will be initialized with a constant values (the values being the fill-values for that column).

This is currently not possible when doing state_transfer. I wonder if a fix here is possible, it would be quite nice to have this feature (somehow specifically tied to the Imputer). I wonder if @maartenbreddels can think of something :)

Also, inspired by @xdssio, I moved the __repr__ method to the base Transformer class, and now it is showing the name of the Transformer class and the arguments with their values.

JovanVeljanoski avatar Jul 29 '21 14:07 JovanVeljanoski