mlxtend icon indicating copy to clipboard operation
mlxtend copied to clipboard

Apriori is not Apriori. It is missing all the nice optimzations

Open kno10 opened this issue 6 years ago • 4 comments

The current apriori implementation is really really slow.

It does not correctly implement the apriori-gen and pruning parts of Apriori. So it is not really apriori as it was published, but a naive approximation to it. The current codes uses (previous itemsets) x (items) to generate tuples, which will generate many false positives that the original Apriori would avoid to generate in the first place (and hence, not need to count later). For good performance, it is crucial to generate only very few candidates.

As such, this implementation benchmarks particularly poor on a data set with 33841 transactions and 1587 items, and minsupp=0.005 (Note that apyrori also does not correctly implement Apriori). Note that I wanted to use minsupp=0.001, but eventually interrupted mlxtend because it was too slow. So these results are now with 5x higher minsupp. Experiments were run on Google Colab, so no guarantees on the reliability of these measurements; but they repeatedly gave similar results with a similar data set before.

Implementation run time
mlxtend 1min 55s
apyrori 4min 2s
efficient-apriori 2min 56s
ELKI (Java) 7 s
pyFIM Apriori 149 ms

This is not just because pyFIM is a C library, but also because it really implements the Apriori algorithm with some of its clever optimizations (c.f. apriori-gen in the original paper; join+prune; as well as optimizations to counting each iteration). The ELKI implementation does not implement the hash tree of the original Apriori, so the 7 seconds could also be a bit better... And yes, I know that FPgrowth is much better. pyFIM has a very nice implementation of it, too; it took just 90 ms for this data.

For example at length 3 there are 2802 frequent itemsets, containing 719 different items. The mlxtend code will generate 2802*719=2014638 candidates from this. The real apriori algorithm joins 40095 cadidates, which are pruned to just 5737 candidates before going to the data again. So the mlxtend code has to check about 350x as many candidates as necessary compared to the original Apriori algorithm in this step. This will, of course, have a major impact on runtime.

kno10 avatar Dec 13 '19 22:12 kno10

Yes, it is currently a naive implementation and could be optimized.

rasbt avatar Dec 13 '19 23:12 rasbt

@kno10 Are your benchmark scripts and data sets available somewhere?

dbarbier avatar Dec 15 '19 09:12 dbarbier

A collection of standard benchmark data sets for FIM is here: http://fimi.uantwerpen.be/data/ An effort to benchmark these implementations should likely begin with this widely used collection.

kno10 avatar Dec 15 '19 11:12 kno10

That's a nice collection, thanks for sharing! Not sure what the sharing policies are exactly, but it would be interesting to incorporate one or two of the smaller ones in the unit tests as well. Also, for aiding the potential improvement of existing implementations, there could be a developer-script added to mlxtend.frequent_patterns for running a set of benchmarks based on these datasets. This could even be a function in mlxtend.frequent_patterns that users can call with different datasets that can be then fetched from the website above to assess the speed of these implementations on their local machine.

rasbt avatar Dec 15 '19 17:12 rasbt