NiaAML
NiaAML copied to clipboard
Support for regression tasks and feature selection
Thank you very much for your hard work on creating a good python package like NiaAML. Could you please consider to support the regression tasks and feature selection with NiaAML?
I use remotely-based data (satellite image) to retrieve the biophysical parameter (blue carbon ;-)) through the machine learning regression and need to select the most contributed features from a suite of input features.
Many thanks, Thang
@hanamthang; @LaurenzBeck is now working on regressions.
Some things I identified to implement the new feature:
β Checklist
- [x] understand core APIs and concepts for the classification tasks
- [x] understand data handling in NiaAML
- [x] π research on regression tasks + decide on libraries to use
- [x] π implement data handling for continuous targets
- [x] π§βπ» implement new regression components
- [x] π§ͺ test implementation
- [x] π document implementation
I just had my first read trhough the README, documentation and the tests.
π₯Problem
Imho. the codebase has a rather high coupling to the classification task specifics and not the highest cohesion. There are two options for implementing the regression feature and I want to get feedback on which path to take.
pipeline_optimizer = PipelineOptimizer(
data=data_reader,
classifiers=['AdaBoost', 'Bagging', 'MultiLayerPerceptron', 'RandomForest', 'ExtremelyRandomizedTrees', 'LinearSVC'],
feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
feature_transform_algorithms=['Normalizer', 'StandardScaler']
)
π§ Options
- Add new abstractions and hierarchies to better differenciate between different tasks semantically
This options entails adding classes like Estimator or Predictor and specific child classes like Classifier and Regressor.
We would also need new hierarchies for the tasks and metrics.
- β more "correct", we do not have to write sth like
classifier="SVMRegressor" - β given the high coupling and low cohesion, adding the additional abtractions presents a lot of additional effort (for just being more correct)
- β this presents a BREAKING change, which requires a major release, extensive documentation and even then, users might be frustrated if their pipelines fail just because they updated their package
pipeline_optimizer = PipelineOptimizer(
data=data_reader,
estimators=['SVMRegressor', 'LinearRegressor'],
task="regression",
feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
feature_transform_algorithms=['Normalizer', 'StandardScaler']
)
- Add new components in a less invasive way as existing components
This option is way less invasive, as it merely adds components. We do not need to change the API in major ways.
- β potentially doable as a minor release, so users can update NiaAML safely without breaking their classification pipelines
- β documentation does not have to be changed dramatically
- βeasier to test and make sure that the new components/implemetnations don't interfere with the existing implementations
- βin some places, we have to accept semantic inconsistiencies like:
pipeline_optimizer = PipelineOptimizer(
data=data_reader,
classifiers=['SVMRegressor', 'LinearRegressor'],
feature_selection_algorithms=['SelectKBest', 'SelectPercentile', 'ParticleSwarmOptimization', 'VarianceThreshold'],
feature_transform_algorithms=['Normalizer', 'StandardScaler']
)
β‘οΈ Rationale
Given that Regression was not the main task when designing the package, the fact that the semantic inconsistencies when sticking to the classifier wording are not that critical and the fact that the scope of my project is quite limited, I have a slight preference for option 2.
What do you say @firefly-cpp ?
Thanks. I totally support the second option.
The coupling to the classification specifics is higher than I originally thought. I also have to adapt the feature selection and pipeline optimization part, since those currently assume fitness functions to deliver values between 0 and 1. This will take me some time, sry for the slow progress on this ticket.
I need some help in understanding https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L468 - https://github.com/firefly-cpp/NiaAML/blob/master/niaaml/pipeline.py#L491
for i in params_all:
args = dict()
for key in i[0]:
if i[0][key] is not None:
if isinstance(i[0][key].value, MinMax):
val = (
solution_vector[solution_index] * i[0][key].value.max
+ i[0][key].value.min
)
if (
i[0][key].param_type is np.intc
or i[0][key].param_type is int
or i[0][key].param_type is np.uintc
or i[0][key].param_type is np.uint
):
val = i[0][key].param_type(np.floor(val))
if val >= i[0][key].value.max:
val = i[0][key].value.max - 1
args[key] = val
else:
args[key] = i[0][key].value[
get_bin_index(
solution_vector[solution_index], len(i[0][key].value)
)
]
solution_index += 1
if i[1] is not None:
i[1].set_parameters(**args)
-> this seems to be some custom (unfortunately undocumented) preprocessing of the parameter configurations.
I do understand the need to call component.set_parameters(**args) in the framework, but I do not understand:
- the connection/need/purpose of the solution_vector
- the logic behind the preprocessing for the
MinMaxcase - the need for the
get_bin_indexfunction
Could you help me with some clarifications @firefly-cpp ? π
My first intuition was removing the params preprocessing part, but this is dangerous if I do not understand those parts...
I found the explanation here: https://niaaml.readthedocs.io/en/latest/getting_started.html#optimization-process-and-parameter-tuning
Thanks, @LaurenzBeck, for all the hard work.
As I stated before, documentation is not in the best shape, and thus, it should be modified and updated as soon as possible.