FPGrowth/FPMax and Association Rules with the existence of missing values (#1004)

Open zazass8 opened this issue 1 year ago • 2 comments

Description

This update intends to implement the FP-Growth and FP-Max algorithms from frequent_patterns with the possibility of missing values in the input dataset. The code implements the same structure and logic of the algorithms, while computing the support metric as in "ignoring" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. Given the output of the algorithm, the corresponding association rules and metrics are also updated taking into account the existence of missing values.

The input accepted for this implementation is a pandas.DataFrame that accepts binary input as 0/1 or True/False (as it was originally), as well as 0/1/NaN or True/False/NaN, where NaN is numpy.nan.

Please also find attached the corresponding paper conducting this research of association rule mining with the existence of missing values. ragel1998.pdf

Related issues or pull requests

Please use this link for the corresponding discussion of this new feature in the issue tracker.

Pull Request Checklist

[x] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
[x] Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
[x] Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
[x] Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
[x] Checked for style issues by running flake8 ./mlxtend

Oct 07 '24 10:10 zazass8