AIF360 icon indicating copy to clipboard operation
AIF360 copied to clipboard

Port pre-processing algorithms to sklearn-compatible API

Open hoffmansc opened this issue 5 years ago • 10 comments

  • [ ] DisparateImpactRemover
  • [x] LearnedFairRepresentation
  • [ ] OptimizedPreprocessing
  • [X] Reweighing

hoffmansc avatar Feb 24 '20 16:02 hoffmansc

Will this make it easier/unnecessary to convert back and forth between Pandas DataFrames and Binary Label Datasets, for example? I've been having issues with Reweighing, as AIF360 tends to only work with numerical data but does not provide instructions for dummification while storing the metadata mappings to de-dummify later on.

InterferencePattern avatar Mar 06 '20 23:03 InterferencePattern

Yes, this will allow DataFrames to be used directly with the algorithms. Reweighing is already implemented so you can try it out if you're comfortable using the master branch from GitHub. It should be released in the latest stable version soon as well.

Do you mind explaining exactly what issues you were facing? Was it with convert_to_dataframe()?

hoffmansc avatar Mar 07 '20 16:03 hoffmansc

Hi @hoffmansc, yes, I've been having trouble understanding how to use convert_to_dataframe() after creating my own BinaryLabelDataset. Perhaps it's my own fault, but I can't find the documentation that describes how to dummify the data in a way that retains the mappings so it can be reversed after using a PreProcessing tool such as Reweighing.

InterferencePattern avatar Mar 09 '20 16:03 InterferencePattern

@jimbudarz, if you encode your categorical data with pd.get_dummies(), or use StandardDataset, you will end up with feature_names that look like, e.g., [..., native-country=United-States, native-country=Vietnam, native-country=Yugoslavia, ...]. Then, if you do convert_to_dataframe(de_dummy_code=True), you will get a DataFrame that looks something like:

  ... native-country
0 ... United-States
1 ... United-States
2 ... Vietnam
...

with the columns magically mapped from one-hot to categories.

You can also include maps for the labels and protected attributes manually (since these are encoded differently) by supplying them when creating the BinaryLabelDataset (note: protected_attribute_maps should be in the same order as protected_attribute_names):

metadata = {
    'label_maps': [{1.0: '>50K', 0.0: '<=50K'}],
    'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-white'},
                                 {1.0: 'Male', 0.0: 'Female'}]
}
BinaryLabelDataset(..., metadata=metadata)

otherwise they will just be 0/1 which is probably also fine.

hoffmansc avatar Mar 09 '20 19:03 hoffmansc

Thanks for the help- this led me to a resolution: Pandas' get_dummies() uses the separator prefix_sep="_" by default, and convert_to_dataframe() uses sep="=" by default.

It might be helpful to explain what sep attribute does in the https://aif360.readthedocs.io/en/latest/modules/datasets.html documentation.

InterferencePattern avatar Mar 09 '20 21:03 InterferencePattern

It might be helpful to explain what sep attribute does in the https://aif360.readthedocs.io/en/latest/modules/datasets.html documentation.

That's a good point. Would you be willing to write a quick PR to that effect?

hoffmansc avatar Mar 09 '20 22:03 hoffmansc

I've gladly submitted a PR.

It looks like reversing dummy-encoding could soon become a part of pandas itself, which AIF360 may be able to leverage for scikit-learn compatibility: https://github.com/pandas-dev/pandas/pull/31795

InterferencePattern avatar Mar 09 '20 23:03 InterferencePattern

Is convert_to_dataframe() supposed to to return the original DataFrame? I am using Reweighing and get back a BinaryLabelDataset which I would like to convert back to a DataFrame(with the weights applied).

razvanh avatar Jul 25 '21 15:07 razvanh

convert_to_dataframe() seems to return a tuple for me, which doesn't seem right. The docs for the aif360 adults dataset states that this method should:

Convert the StructuredDataset to a pandas.DataFrame.

However, it doesn't appear to do so.

from aif360.datasets import AdultDataset
ad = AdultDataset(
    protected_attribute_names=['sex'],
    privileged_classes=[['Male']],
    categorical_features=[],
    features_to_keep=['age', 'education-num']
)
df = ad.convert_to_dataframe()
print(type(df))
# <class 'tuple'>

theBull avatar Mar 06 '22 23:03 theBull

Ah. convert_to_dataframe() returns two values (a tuple), as such:

dataframe, dictionary = dataset.convert_to_dataframe()

print(type(dataframe))
print(type(dictionary))
#<class 'pandas.core.frame.DataFrame'>
#<class 'dict'>

Silly mistake. I forget that when you return multiple values in python, it returns them as a tuple. I'm still new to the language. HTH.

theBull avatar Mar 06 '22 23:03 theBull