NimbusML
NimbusML copied to clipboard
CV creates incorrect split of user defined transforms.
When specifying split_start='after_transforms'
in CV.fit()
, the user defined transforms are not split up correctly. See the graph created by the fit()
call in the code below.
It seems like if a user defined transform has presteps then the split location will not be in the right place. This might also effect splitting the transforms given an integer value.
from nimbusml import DataSchema, FileDataStream
from nimbusml.datasets import get_dataset
from nimbusml.ensemble import LightGbmRegressor
from nimbusml.model_selection import CV
from nimbusml.preprocessing.missing_values import Indicator, Handler
path = get_dataset("airquality").as_filepath()
schema = DataSchema.read_schema(path)
data = FileDataStream(path, schema)
pipeline_steps = [
Indicator() << {
'Ozone_ind': 'Ozone',
'Solar_R_ind': 'Solar_R'},
Handler(
replace_with='Mean') << {
'Solar_R': 'Solar_R',
'Ozone': 'Ozone'},
LightGbmRegressor(
feature=['Ozone',
'Solar_R',
'Ozone_ind',
'Solar_R_ind',
'Temp'],
label='Wind')]
cv_results = CV(pipeline_steps).fit(data, split_start='after_transforms')
Commit d5c7c828ef820d681e2cf5e38568177200cb3b3c resolves the issue with split_start='after_transforms'
but it does not fix the issue when the user specifies an integer index as the split_start
value.
When a transform has presteps then the integer index the user specified will not correspond to the index of the transform in the pipeline.