mlxtend
mlxtend copied to clipboard
Is there a way to use the TransactionEncoder and FP Growth with a large CSV?
Per the example:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
What's the best way to use chunking so that we can get a transformed dataframe that can be used for rule mining?
Currently, the implementations of the TransactionEncoder and frequent itemset mining algorithms don't support chunking.
What may help though is using a sparse dataframe for frequent itemset and rule mining. For example, if you set .transform(X, sparse=True) for the TransactionEncoder, it will return a sparse DataFrame.
It just occurs to me that sth like Dask dataframes, which have out-of-core support, could also work, but I have not tested this -- currently, we only use pandas DataFrames for testing