Allow duplicate objects in Pipeline and ColumnTransformer
Currently neither Pipeline nor ColumnTransformer may contain two different steps with the same type of transformer. I think this should be allowed.
Consider a scenario where I have a dataset with numeric and categorical values (e.g. feature 1 and 2, respectively), and wish to impute them with a different imputation strategy. I would use the following code (with openml on head of develop):
import openml
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Assume a dataset with feature 0 being numeric, and feature 1 being nominal
pipeline = Pipeline(
[('preprocessing', ColumnTransformer(
[('impute_numeric', SimpleImputer(strategy='mean'), [0]),
('impute_categorical', SimpleImputer(strategy='median'), [1])])),
('classifier', DecisionTreeClassifier())])
openml.flows.sklearn_to_flow(pipeline)
I would assume this should work, but it raises the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
rval = _serialize_model(o)
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 404, in _serialize_model
_extract_information_from_model(model)
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 512, in _extract_information_from_model
rval = sklearn_to_flow(v, model)
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
rval = [sklearn_to_flow(element, parent_model) for element in o]
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
rval = [sklearn_to_flow(element, parent_model) for element in o]
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in sklearn_to_flow
rval = [sklearn_to_flow(element, parent_model) for element in o]
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 50, in <listcomp>
rval = [sklearn_to_flow(element, parent_model) for element in o]
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 47, in sklearn_to_flow
rval = _serialize_model(o)
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 408, in _serialize_model
_check_multiple_occurence_of_component_in_flow(model, subcomponents)
File "D:\repositories\openml-python\openml\flows\sklearn_converter.py", line 490, in _check_multiple_occurence_of_component_in_flow
'trying to serialize %s.' % (visitee.name, model))
ValueError: Found a second occurence of component sklearn.impute.SimpleImputer when trying to serialize ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
transformer_weights=None,
transformers=[('impute_numeric', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
verbose=0), [0]), ('impute_categorical', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
strategy='median', verbose=0), [1])]).
Similarly an error is raised if a pipeline contains two steps of the same type.
What is the reason this error is raised? Is it simply not yet supported? Or should I be ordering my workflow differently, and if so, how?
This is a problem of the OpenML Flow definition, as defined in the early days of OpenML (2012). There is currently no uniform way to specify to which specific instance of the flow a hyperparameter setting in a run belongs, and as such having multiple instantiations of the same subflow in a complex flow does not allow for reproducible research.
It has been on the agenda to improve this server side, however no one has started programming / testing alternatives.
Thanks, that clarifies a lot. Does it make sense to leave this issue open as it will go unresolved? Or should I close it as 'we' on the package side can not fix this until the definitions are updated?
I think closing and referencing the corresponding issue on the OpenML issue tracker is the way to go here: https://github.com/openml/OpenML/issues/340
Reopening to show that this is a known issue.
Marked it as wontfix because we won't (can't) fix this until we rework the flow definition.