mlxtend icon indicating copy to clipboard operation
mlxtend copied to clipboard

Apriori - I'm only interested in one product / consequent. Can I speed up the apriori algorithm?

Open soliverc opened this issue 4 years ago • 6 comments

For example, if I'm only interested in items frequently bought with eggs, is there a way to avoid generating itemsets for every other possible combination? This takes a lot of time, and with the generated rule set I end up filtering the huge dataframe to find only rows where consequents == eggs, discarding everything else. Can I speed this process up if I'm only interested in one consequent?

soliverc avatar Jan 22 '20 14:01 soliverc

Sorry, there's currently no option for that. A "must_contain=('some item', 'some other item', ...)" feature could be a useful enhancement though.

rasbt avatar Jan 22 '20 15:01 rasbt

There are two things here. If you are only interested in frequent items containing some item, a "must_contain" type feature would be useful, and fairly straightforward to implement.

If you are interested in filtering on the consequent in rules, then that is more challenging. For example, say you are only interested in rules with consequent==eggs. You cannot mine only frequent patterns containing "eggs" because the rules rely on supports of other items. And, to get those, you need patterns which may not contain "eggs".

harenbergsd avatar May 20 '20 03:05 harenbergsd

I agree with you @harenbergsd . In addition, while adding a check for whether frequent itemsets contain a particular item may look useful in practice, I am not sure if that would necessarily speed things up (compared to pruning the resulting data frame). I think the major thing that it would accomplish is reducing the memory footprint of the resulting dataframe, because it really is just a check of whether an already computed itemset is to be added to the dataframe or not.

rasbt avatar May 20 '20 21:05 rasbt

I found this blog post which I believe talks about this sort of thing: https://wimleers.com/article/fp-growth-powered-association-rule-mining-with-support-for-constraints

Like you, I do wonder how much perf improvement you could get by doing this sort of thing. It's certainly an interesting idea. You have to imagine, if you are only interested in a few consequents, you end up doing a lot of extra work and tossing stuff a way. But, it's tricky to know what is the extra work. Nice research project, depending on existing literature in this area :)

harenbergsd avatar May 23 '20 15:05 harenbergsd

The performance of MLXtend "apriori" is pretty poor. It does not implement the optimizations of the original Apriori algorithm such as the prefix join to avoid generating unnecessary candidates or the hash tree; but it is a quite naive version instead that generates way too many candidates. See #644 - the runtime of MLXtend was 2 minutes, of pyfim just 150 ms; so MLXtend was about 750x slower. Hence if performance is a problem for you, consider using the well-known much faster "fim" module by Christian Borgelt: http://www.borgelt.net/pyfim.html - or improve the Apriori of MLXtend and first add the missing nice optimizations that are part of the Apriori algorithm.

kno10 avatar May 26 '20 11:05 kno10

Yes, https://github.com/rasbt/mlxtend/pull/646 should address that.

rasbt avatar May 26 '20 15:05 rasbt