DeepTab icon indicating copy to clipboard operation
DeepTab copied to clipboard

[FAQ]Should I use mambular.preprocessing.Preprocessor and how?

Open YuanfengZhang opened this issue 10 months ago • 2 comments

Context There is currently no examples in documentation about the usage of mambular.preprocessing.Preprocessor. Appearently the Processor is applied in the fit() function.

Describe the task you are trying to achieve. Manually set the method to preprocess for each column.

Describe the solution you'd like A minimal example.

YuanfengZhang avatar Mar 02 '25 08:03 YuanfengZhang

Thanks for raising this. Below are some examples, but we will add better documentation in the next release. Please leave this issue open until then:

Simple example

Generally, the preprocessor follow the sklearn preprocessing modules, i.e. the methods fit, and fit_transform with a few minor exceptions.

from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.frame # Pass a pd.DataFrame for column names
y = california_housing.target

from mambular.preprocessing import Preprocessor
prepro = Preprocessor(numerical_preprocessing="ple", n_bins=16, task="regression")

preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings

Note, that preprocessed_data, other than for standard sklearn preprocessor is now a dictionary, containing the keys "d_type" + "column_name"

Example with individually preprocessed columns

prepro = Preprocessor(
    numerical_preprocessing="minmax",
    feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"}, 
    n_bins=16, 
    task="regression",
    )

preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings

assert preprocessed_data["num_Latitude"].shape == (X.shape[0], 16) # assert that Latitude was preprocessed using PLE

Note, that in the current form, the column names are case dependent, so be aware of passing the correct column names.

Get information

Since during ple, the number of set bins can be smaller than those set, when the decision tree finds fewer bins, getting information about the shapes and chosen steps can be useful. To get the information that is displayed when calling model.fit(), you can run the following:


prepro.get_feature_info()

If you have other suggestions/ideas for improvements, feel free to comment/raise another issue.

AnFreTh avatar Mar 02 '25 09:03 AnFreTh

And to clarify: If you fit any model, you do not need to call the preprocessor manually, it is handled inside the build_model() functionality. The arguments from the examples above can be used in the initialization of a model, i.e:

from mambular.models import MambularRegressor

model = MambularRegressor(
    numerical_preprocessing="minmax",
    feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"}, 
    n_bins=16
)

model.fit(X, y)

Here the preprocessing is applied automatically and there is no need to implicitly call the preprocessor.

AnFreTh avatar Mar 02 '25 09:03 AnFreTh