FPGrowth/FPMax and Association Rules with the existence of missing values (#1004)
Description
This update intends to implement the FP-Growth and FP-Max algorithms from frequent_patterns with the possibility of missing values in the input dataset. The code implements the same structure and logic of the algorithms, while computing the support metric as in "ignoring" the missing values in the data. That gives a more realistic indication of the frequency of existence in the items/itemsets that are generated from the algorithm. Given the output of the algorithm, the corresponding association rules and metrics are also updated taking into account the existence of missing values.
The input accepted for this implementation is a pandas.DataFrame that accepts binary input as 0/1 or True/False (as it was originally), as well as 0/1/NaN or True/False/NaN, where NaN is numpy.nan.
Please also find attached the corresponding paper conducting this research of association rule mining with the existence of missing values. ragel1998.pdf
Related issues or pull requests
Please use this link for the corresponding discussion of this new feature in the issue tracker.
Pull Request Checklist
- [x] Added a note about the modification or contribution to the
./docs/sources/CHANGELOG.mdfile (if applicable) - [x] Added appropriate unit test functions in the
./mlxtend/*/testsdirectories (if applicable) - [x] Modify documentation in the corresponding Jupyter Notebook under
mlxtend/docs/sources/(if applicable) - [x] Ran
PYTHONPATH='.' pytest ./mlxtend -svand make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv) - [x] Checked for style issues by running
flake8 ./mlxtend