NiaAML icon indicating copy to clipboard operation
NiaAML copied to clipboard

Data squashing

Open firefly-cpp opened this issue 1 year ago • 3 comments

Adding data squashing as a preprocessing method in the pipeline is also worth adding (probably useful).

It is already implemented here: https://github.com/firefly-cpp/arm-preprocessing

firefly-cpp avatar Apr 06 '24 19:04 firefly-cpp

what does the squasching operation do? I found that arm-preprocessing just calls https://github.com/firefly-cpp/NiaARM/blob/main/niaarm/preprocessing.py#L34 Can this be implemented as a FeatureTransformAlgorithm?

LaurenzBeck avatar Apr 23 '24 10:04 LaurenzBeck

""Data squashing is a preprocessing method that enables construction of smaller datasets from the original ones and provides approximately the same results of data analysis as the original."

firefly-cpp avatar Apr 23 '24 11:04 firefly-cpp

I just revisited the ticket.

Based on my understanding of the method, it does neither fit into the category of feature_selection_algorithms, nor feature_transform_algorithms. I think a cleaner option would be to introduce a sample_selection or dataset_pruning component class with possible implementations:

  • full / None -> use the whole dataset
  • random(fraction) -> use a random fraction of the data
  • squashing(threshold) -> your proposed method

Optionally, one could also repurpose feature_transform_algorithms into a general preprocessing component class.

Either way, Given that most users probably work with rather small datasets (as larger ones are in my experience the exception) and the current run-times are acceptable, I think my time on this project is better spent on the other tickets.

LaurenzBeck avatar Jun 11 '24 10:06 LaurenzBeck