AIF360
AIF360 copied to clipboard
Port pre-processing algorithms to sklearn-compatible API
- [ ] DisparateImpactRemover
- [x] LearnedFairRepresentation
- [ ] OptimizedPreprocessing
- [X] Reweighing
Will this make it easier/unnecessary to convert back and forth between Pandas DataFrames and Binary Label Datasets, for example? I've been having issues with Reweighing, as AIF360 tends to only work with numerical data but does not provide instructions for dummification while storing the metadata mappings to de-dummify later on.
Yes, this will allow DataFrames to be used directly with the algorithms. Reweighing is already implemented so you can try it out if you're comfortable using the master branch from GitHub. It should be released in the latest stable version soon as well.
Do you mind explaining exactly what issues you were facing? Was it with convert_to_dataframe()
?
Hi @hoffmansc, yes, I've been having trouble understanding how to use convert_to_dataframe() after creating my own BinaryLabelDataset. Perhaps it's my own fault, but I can't find the documentation that describes how to dummify the data in a way that retains the mappings so it can be reversed after using a PreProcessing tool such as Reweighing.
@jimbudarz, if you encode your categorical data with pd.get_dummies()
, or use StandardDataset
, you will end up with feature_names
that look like, e.g., [..., native-country=United-States, native-country=Vietnam, native-country=Yugoslavia, ...]
. Then, if you do convert_to_dataframe(de_dummy_code=True)
, you will get a DataFrame that looks something like:
... native-country
0 ... United-States
1 ... United-States
2 ... Vietnam
...
with the columns magically mapped from one-hot to categories.
You can also include maps for the labels and protected attributes manually (since these are encoded differently) by supplying them when creating the BinaryLabelDataset (note: protected_attribute_maps
should be in the same order as protected_attribute_names
):
metadata = {
'label_maps': [{1.0: '>50K', 0.0: '<=50K'}],
'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-white'},
{1.0: 'Male', 0.0: 'Female'}]
}
BinaryLabelDataset(..., metadata=metadata)
otherwise they will just be 0/1 which is probably also fine.
Thanks for the help- this led me to a resolution: Pandas' get_dummies() uses the separator prefix_sep="_" by default, and convert_to_dataframe() uses sep="=" by default.
It might be helpful to explain what sep attribute does in the https://aif360.readthedocs.io/en/latest/modules/datasets.html documentation.
It might be helpful to explain what sep attribute does in the https://aif360.readthedocs.io/en/latest/modules/datasets.html documentation.
That's a good point. Would you be willing to write a quick PR to that effect?
I've gladly submitted a PR.
It looks like reversing dummy-encoding could soon become a part of pandas itself, which AIF360 may be able to leverage for scikit-learn compatibility: https://github.com/pandas-dev/pandas/pull/31795
Is convert_to_dataframe()
supposed to to return the original DataFrame? I am using Reweighing
and get back a BinaryLabelDataset
which I would like to convert back to a DataFrame(with the weights applied).
convert_to_dataframe()
seems to return a tuple for me, which doesn't seem right. The docs for the aif360 adults dataset states that this method should:
Convert the StructuredDataset to a pandas.DataFrame.
However, it doesn't appear to do so.
from aif360.datasets import AdultDataset
ad = AdultDataset(
protected_attribute_names=['sex'],
privileged_classes=[['Male']],
categorical_features=[],
features_to_keep=['age', 'education-num']
)
df = ad.convert_to_dataframe()
print(type(df))
# <class 'tuple'>
Ah. convert_to_dataframe()
returns two values (a tuple), as such:
dataframe, dictionary = dataset.convert_to_dataframe()
print(type(dataframe))
print(type(dictionary))
#<class 'pandas.core.frame.DataFrame'>
#<class 'dict'>
Silly mistake. I forget that when you return multiple values in python, it returns them as a tuple. I'm still new to the language. HTH.