Ambrosia
Ambrosia copied to clipboard
Implementation of basic PySpark data preprocessing methods
For the tasks of preprocessing pandas
data and speeding up experiments, we have the Preprocessor
class and a number of base classes with single functionality at preprocessing.
These methods should be implemented for spark
dataframes, in the same paradigm as we have for the Designer
and the Splitter
.
At this moment, the implementation of the following methods is essential:
- Aggregation
- Outliers removal (robust)
- CUPED
Still did not take into account the possibility of PySpark functionality implementation in the architecture of the added preprocessing classes in #22