mlxtend icon indicating copy to clipboard operation
mlxtend copied to clipboard

Is there a way to use the TransactionEncoder and FP Growth with a large CSV?

Open jnguyen32 opened this issue 6 years ago • 2 comments

Per the example:

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

What's the best way to use chunking so that we can get a transformed dataframe that can be used for rule mining?

jnguyen32 avatar Sep 06 '19 17:09 jnguyen32

Currently, the implementations of the TransactionEncoder and frequent itemset mining algorithms don't support chunking.

What may help though is using a sparse dataframe for frequent itemset and rule mining. For example, if you set .transform(X, sparse=True) for the TransactionEncoder, it will return a sparse DataFrame.

rasbt avatar Sep 06 '19 20:09 rasbt

It just occurs to me that sth like Dask dataframes, which have out-of-core support, could also work, but I have not tested this -- currently, we only use pandas DataFrames for testing

rasbt avatar Sep 06 '19 20:09 rasbt