[FAQ]Should I use mambular.preprocessing.Preprocessor and how?
Context There is currently no examples in documentation about the usage of mambular.preprocessing.Preprocessor. Appearently the Processor is applied in the fit() function.
Describe the task you are trying to achieve. Manually set the method to preprocess for each column.
Describe the solution you'd like A minimal example.
Thanks for raising this. Below are some examples, but we will add better documentation in the next release. Please leave this issue open until then:
Simple example
Generally, the preprocessor follow the sklearn preprocessing modules, i.e. the methods fit, and fit_transform with a few minor exceptions.
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.frame # Pass a pd.DataFrame for column names
y = california_housing.target
from mambular.preprocessing import Preprocessor
prepro = Preprocessor(numerical_preprocessing="ple", n_bins=16, task="regression")
preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings
Note, that preprocessed_data, other than for standard sklearn preprocessor is now a dictionary, containing the keys "d_type" + "column_name"
Example with individually preprocessed columns
prepro = Preprocessor(
numerical_preprocessing="minmax",
feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"},
n_bins=16,
task="regression",
)
preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings
assert preprocessed_data["num_Latitude"].shape == (X.shape[0], 16) # assert that Latitude was preprocessed using PLE
Note, that in the current form, the column names are case dependent, so be aware of passing the correct column names.
Get information
Since during ple, the number of set bins can be smaller than those set, when the decision tree finds fewer bins, getting information about the shapes and chosen steps can be useful. To get the information that is displayed when calling model.fit(), you can run the following:
prepro.get_feature_info()
If you have other suggestions/ideas for improvements, feel free to comment/raise another issue.
And to clarify: If you fit any model, you do not need to call the preprocessor manually, it is handled inside the build_model() functionality. The arguments from the examples above can be used in the initialization of a model, i.e:
from mambular.models import MambularRegressor
model = MambularRegressor(
numerical_preprocessing="minmax",
feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"},
n_bins=16
)
model.fit(X, y)
Here the preprocessing is applied automatically and there is no need to implicitly call the preprocessor.