NiaAML
NiaAML copied to clipboard
Data squashing
Adding data squashing as a preprocessing method in the pipeline is also worth adding (probably useful).
It is already implemented here: https://github.com/firefly-cpp/arm-preprocessing
what does the squasching operation do? I found that arm-preprocessing just calls https://github.com/firefly-cpp/NiaARM/blob/main/niaarm/preprocessing.py#L34
Can this be implemented as a FeatureTransformAlgorithm?
""Data squashing is a preprocessing method that enables construction of smaller datasets from the original ones and provides approximately the same results of data analysis as the original."
I just revisited the ticket.
Based on my understanding of the method, it does neither fit into the category of feature_selection_algorithms, nor feature_transform_algorithms.
I think a cleaner option would be to introduce a sample_selection or dataset_pruning component class with possible implementations:
full/ None -> use the whole datasetrandom(fraction)-> use a random fraction of the datasquashing(threshold) -> your proposed method
Optionally, one could also repurpose feature_transform_algorithms into a general preprocessing component class.
Either way, Given that most users probably work with rather small datasets (as larger ones are in my experience the exception) and the current run-times are acceptable, I think my time on this project is better spent on the other tickets.