NimbusML CV creates incorrect split of user defined transforms.

CV creates incorrect split of user defined transforms.

Open pieths opened this issue 5 years ago • 1 comments

When specifying split_start='after_transforms' in CV.fit(), the user defined transforms are not split up correctly. See the graph created by the fit() call in the code below.

It seems like if a user defined transform has presteps then the split location will not be in the right place. This might also effect splitting the transforms given an integer value.

from nimbusml import DataSchema, FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.ensemble import LightGbmRegressor
from nimbusml.model_selection import CV
from nimbusml.preprocessing.missing_values import Indicator, Handler

path = get_dataset("airquality").as_filepath()
schema = DataSchema.read_schema(path)
data = FileDataStream(path, schema)

pipeline_steps = [
    Indicator() << {
        'Ozone_ind': 'Ozone',
        'Solar_R_ind': 'Solar_R'},
    Handler(
        replace_with='Mean') << {
        'Solar_R': 'Solar_R',
        'Ozone': 'Ozone'},
    LightGbmRegressor(
        feature=['Ozone',
                 'Solar_R',
                 'Ozone_ind',
                 'Solar_R_ind',
                 'Temp'],
        label='Wind')]

cv_results = CV(pipeline_steps).fit(data, split_start='after_transforms')

Jan 11 '20 00:01 pieths

Commit d5c7c828ef820d681e2cf5e38568177200cb3b3c resolves the issue with split_start='after_transforms' but it does not fix the issue when the user specifies an integer index as the split_start value.

When a transform has presteps then the integer index the user specified will not correspond to the index of the transform in the pipeline.

Jan 21 '20 21:01 pieths

NimbusML NimbusML copied to clipboard

CV creates incorrect split of user defined transforms.

NimbusML
NimbusML copied to clipboard