mlxtend
mlxtend copied to clipboard
[WIP] Implement apriori-gen as in original paper
Description
Implement apriori-gen as in original Apriori paper. This is a draft PR for discussion; different changes are proposed, they should be benchmarked.
- First commit implements join step
- Second commit implements prune step
- Third commit enforces
low_memory=True
processing; thanks to previous optimizations, it is now as fast aslow_memory=False
and requires less memory; frequent itemsets are stored as list of tuples instead of Numpy arrays - Fourth commit replaces trie implementation by set
- Fifth commit replaces
_support
function, which was really slow on some test cases
Related issues or pull requests
Reported in #644.
Pull Request Checklist
- [ ] Added a note about the modification or contribution to the
./docs/sources/CHANGELOG.md
file (if applicable) - [ ] Added appropriate unit test functions in the
./mlxtend/*/tests
directories (if applicable) - [ ] Modify documentation in the corresponding Jupyter Notebook under
mlxtend/docs/sources/
(if applicable) - [ ] Ran
PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
) - [ ] Checked for style issues by running
flake8 ./mlxtend
Checklist empty for now since this is a draft pull request.
Grrr, UI for drafts PR is terrible, I forgot to change PR type to draft :-/
Hello @dbarbier! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2020-01-06 13:19:09 UTC
Here are some benchmark results; data must first be downloaded from http://fimi.uantwerpen.be/data/ and put inside a data
subdirectory. Some runs take a very long time, thus I decided to set a timeout of 120s, in which case NA is reported in tables below, with the itemset size being processed. Script below only works on Linux, I am not sure how to implement timeout on Windows.
Benchmark script
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import pandas as pd
import numpy as np
import gzip
import os
from time import time
import signal
from contextlib import contextmanager
@contextmanager
def timeout(time):
# Register a function to raise a TimeoutError on the signal.
signal.signal(signal.SIGALRM, raise_timeout)
# Schedule the signal to be sent after ``time``.
signal.alarm(time)
try:
yield
except TimeoutError:
pass
finally:
# Unregister the signal so it won't be triggered
# if the timeout is not reached.
signal.signal(signal.SIGALRM, signal.SIG_IGN)
def raise_timeout(signum, frame):
raise TimeoutError
files = [
#"chess.dat.gz",
"connect.dat.gz",
"mushroom.dat.gz",
"pumsb.dat.gz",
"pumsb_star.dat.gz",
# "T10I4D100K.dat.gz", these 3 files are too large
# "T40I10D100K.dat.gz",
# "kosarak.dat.gz"
]
# Modify these 2 variables
sparse = False
low_memory = True
for filename in files:
with gzip.open(os.path.join("data", filename)) if filename.endswith(
".gz"
) else open(os.path.join("data", filename)) as f:
data = f.readlines()
dataset = [list(map(int, line.split())) for line in data]
items = np.unique([item for itemset in dataset for item in itemset])
print(f"{filename} contains {len(dataset)} transactions and {len(items)} items")
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
if sparse:
try:
df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
except AttributeError:
# pandas < 0.25
df = pd.SparseDataFrame(te_ary, columns=te.columns_, default_fill_value=False)
else:
df = pd.DataFrame(te_ary, columns=te.columns_)
df.columns = ["c"+str(i) for i in df.columns]
for min_support in [0.5, 0.3, 0.1, 0.05, 0.03, 0.01, 0.005]:
tick = time()
with timeout(120):
print(apriori(df, min_support=min_support, verbose=1, use_colnames=False, low_memory=low_memory))
print(f"\nmin_support={min_support} temps: {time() - tick}\n")
Some commits have either an asterisk (*) or a plus sign (+), as well as (F) or (T).
- An asterisk tells that code has been modified to operate on smaller matrix blocks, because there would otherwise be memory errors on my machine. All such commits have the same block size, so their relative speed is still relevant.
- A plus sign tells that 1 should be added to number between parentheses; the reason is that commit
4c82dcf
displays the number of combinations after processing them, instead of before, so the last number is missing when timeout is reached. - F or T refers to
low_memory
argument (eitherFalse
orTrue
)
With dense DataFrame:
connect.dat.gz 67557 transactions and 129 items
min_support | master*(F) | master(T) | 96dfd4d*(F) | 4efeb8c*(F) | b90a146*(F) | 1fabe25*(F) | 1fabe25(T) | 58c95f1(F) | 58c95f1(T) | 4c82dcf+ | 423f9e7+ |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | NA(5) | NA(4) | NA(5) | NA(5) | NA(5) | NA(6) | NA(4) | NA(7) | NA(7) | NA+(6) | NA+(6) |
0.3 | NA(5) | NA(4) | NA(4) | NA(5) | NA(5) | NA(6) | NA(4) | NA(6) | NA(6) | NA+(5) | NA+(5) |
0.1 | NA(4) | NA(4) | NA(4) | NA(4) | NA(4) | NA(5) | NA(4) | NA(5) | NA(5) | NA+(4) | NA+(4) |
0.05 | NA(4) | NA(4) | NA(4) | NA(4) | NA(4) | NA(4) | NA(3) | NA(5) | NA(5) | NA+(4) | NA+(4) |
0.03 | NA(4) | NA(4) | NA(4) | NA(4) | NA(4) | NA(4) | NA(3) | NA(4) | NA(4) | NA+(4) | NA+(4) |
0.01 | NA(4) | NA(3) | NA(4) | NA(4) | NA(4) | NA(4) | NA(3) | NA(4) | NA(4) | NA+(4) | NA+(4) |
0.005 | NA(4) | NA(3) | NA(4) | NA(4) | NA(4) | NA(4) | NA(3) | NA(4) | NA(4) | NA+(4) | NA+(4) |
mushroom.dat.gz 8124 transactions and 119 items
min_support | master*(F) | master(T) | 96dfd4d*(F) | 4efeb8c*(F) | b90a146*(F) | 1fabe25*(F) | 1fabe25(T) | 58c95f1(F) | 58c95f1(T) | 4c82dcf+ | 423f9e7+ |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | 0.04 | 0.05 | 0.04 | 0.04 | 0.04 | 0.02 | 0.09 | 0.02 | 0.02 | 0.02 | 0.02 |
0.3 | 1.24 | 0.64 | 0.38 | 0.35 | 0.37 | 0.13 | 0.89 | 0.11 | 0.11 | 0.13 | 0.10 |
0.1 | NA(7) | NA(12) | 107 | 103 | 109 | 35 | 89 | 32 | 33 | 38 | 25 |
0.05 | NA(6) | NA(7) | NA(7) | NA(7) | NA(7) | NA(7) | NA(7) | NA(9) | NA(9) | NA+(8) | NA+(9) |
0.03 | NA(5) | NA(6) | NA(6) | NA(6) | NA(6) | NA(7) | NA(7) | NA(7) | NA(7) | NA+(7) | NA+(7) |
0.01 | NA(5) | NA(5) | NA(5) | NA(5) | NA(5) | NA(6) | NA(6) | NA(6) | NA(6) | NA+(5) | NA+(6) |
0.005 | NA(5) | NA(5) | NA(5) | NA(5) | NA(5) | NA(5) | NA(5) | NA(5) | NA(5) | NA+(5) | NA+(5) |
pumsb.dat.gz 49046 transactions and 2113 items
min_support | master*(F) | master(T) | 96dfd4d*(F) | 4efeb8c*(F) | b90a146*(F) | 1fabe25*(F) | 1fabe25(T) | 58c95f1(F) | 58c95f1(T) | 4c82dcf+ | 423f9e7+ |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | NA(5) | NA(3) | NA(5) | NA(5) | NA(5) | NA(6) | NA(3) | NA(6) | NA(6) | NA+(6) | NA+(6) |
0.3 | NA(4) | NA(3) | NA(4) | NA(4) | NA(4) | NA(5) | NA(2) | NA(5) | NA(5) | NA+(4) | NA+(4) |
0.1 | NA(4) | NA(3) | NA(4) | NA(4) | NA(4) | NA(4) | NA(2) | NA(4) | NA(4) | NA+(4) | NA+(4) |
0.05 | NA(3) | NA(3) | NA(3) | NA(3) | NA(3) | NA(4) | NA(2) | NA(4) | NA(4) | NA+(3) | NA+(3) |
0.03 | NA(3) | NA(3) | NA(3) | NA(3) | NA(3) | NA(4) | NA(2) | NA(4) | NA(4) | NA+(3) | NA+(3) |
0.01 | NA(3) | NA(3) | NA(3) | NA(3) | NA(3) | NA(3) | NA(2) | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.005 | NA(3) | NA(2) | NA(3) | NA(3) | NA(3) | NA(3) | NA(2) | NA(3) | NA(3) | NA+(3) | NA+(3) |
pumsb_star.dat.gz 49046 transactions and 2088 items
min_support | master*(F) | master(T) | 96dfd4d*(F) | 4efeb8c*(F) | b90a146*(F) | 1fabe25*(F) | 1fabe25(T) | 58c95f1(F) | 58c95f1(T) | 4c82dcf+ | 423f9e7 |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | 3.76 | 9.40 | 0.76 | 0.73 | 0.59 | 0.46 | 53 | 0.34 | 0.37 | 0.34 | 0.35 |
0.3 | NA(5) | NA(4) | NA(7) | NA(7) | NA(7) | 66 | NA(3) | 35 | 36 | 41 | 31 |
0.1 | NA(4) | NA(3) | NA(4) | NA(4) | NA(4) | NA(5) | NA(2) | NA(5) | NA(5) | NA+(5) | NA+(5) |
0.05 | NA(4) | NA(3) | NA(4) | NA(4) | NA(4) | NA(4) | NA(2) | NA(5) | NA(5) | NA+(4) | NA+(4) |
0.03 | NA(3) | NA(3) | NA(4) | NA(3) | NA(4) | NA(4) | NA(2) | NA(4) | NA(4) | NA+(4) | NA+(4) |
0.01 | NA(3) | NA(3) | NA(3) | NA(3) | NA(3) | NA(4) | NA(2) | NA(4) | NA(4) | NA+(3) | NA+(3) |
0.005 | NA(3) | NA(2) | NA(3) | NA(3) | NA(3) | NA(3) | NA(2) | NA(3) | NA(3) | NA+(3) | NA+(3) |
With sparse DataFrame
connect.dat.gz 67557 transactions and 129 items
min_support | master(T) | 58c95f1(T) | 4c82dcf+ | 423f9e7+ |
---|---|---|---|---|
0.5 | NA(4) | NA(4) | NA+(4) | NA+(4) |
0.3 | NA(4) | NA(4) | NA+(3) | NA+(3) |
0.1 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.05 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.03 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.01 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.005 | NA(3) | NA(3) | NA+(3) | NA+(3) |
mushroom.dat.gz 8124 transactions and 119 items
min_support | master(T) | 58c95f1(T) | 4c82dcf+ | 423f9e7+ |
---|---|---|---|---|
0.5 | 0.18 | 0.18 | 0.07 | 0.06 |
0.3 | 3.07 | 3.19 | 0.92 | 0.84 |
0.1 | NA(6) | NA(6) | NA+(7) | NA+(8) |
0.05 | NA(5) | NA(5) | NA+(5) | NA+(5) |
0.03 | NA(5) | NA(5) | NA+(5) | NA+(5) |
0.01 | NA(4) | NA(4) | NA+(4) | NA+(4) |
0.005 | NA(4) | NA(4) | NA+(4) | NA+(4) |
pumsb.dat.gz 49046 transactions and 2113 items
min_support | master(T) | 58c95f1(T) | 4c82dcf+ | 423f9e7+ |
---|---|---|---|---|
0.5 | NA(4) | NA(4) | NA+(3) | NA+(3) |
0.3 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.1 | NA(3) | NA(3) | NA+(2) | NA+(2) |
0.05 | NA(3) | NA(3) | NA+(2) | NA+(2) |
0.03 | NA(3) | NA(2) | NA+(2) | NA+(2) |
0.01 | NA(2) | NA(2) | NA+(2) | NA+(2) |
0.005 | NA(2) | NA(2) | NA+(2) | NA+(2) |
pumsb_star.dat.gz 49046 transactions and 2088 items
min_support | master(T) | 58c95f1(T) | 4c82dcf+ | 423f9e7+ |
---|---|---|---|---|
0.5 | 2.97 | 3.08 | 1.16 | 1.16 |
0.3 | NA(5) | NA(5) | NA+(5) | NA+(5) |
0.1 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.05 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.03 | NA(3) | NA(3) | NA+(3) | NA+(3) |
0.01 | NA(3) | NA(3) | NA+(2) | NA+(2) |
0.005 | NA(2) | NA(2) | NA+(2) | NA+(2) |
More tests should be run, maybe with kosarak.dat; when the best option is decided, I will remove/squash/rearrange commits and update docstrings. IMHO the best option is with current head.
Okay, I ran some tests with kosarak.dat and sparse=False
. For small inputs, there is a noticeable overhead, which is caused by np.asfortranarray
. All these tests had been run with a DataFrame wih row major storage format (order='C'
in Numpy). In tables below, an asterisk means that run is performed with a DataFrame already in column storage format, so that np.asfortranarry
becomes a noop; this is achieved by adding
df = pd.DataFrame({col: df[col] for col in df.columns})
(F) and (T) still refers to low_memory=False
and low_memory=True
arguments respectively.
1k first lines of kosarak.dat.gz 1000 transactions and 3259 items, sparse=False
min_support | master(F) | master(T) | master*(F) | master*(T) | 423f9e7 | 423f9e7* |
---|---|---|---|---|---|---|
0.5 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
0.3 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 |
0.1 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 |
0.05 | 0.02 | 0.02 | 0.02 | 0.04 | 0.02 | 0.02 |
0.03 | 0.02 | 0.02 | 0.02 | 0.04 | 0.02 | 0.02 |
0.01 | 0.11 | 0.04 | 0.06 | 0.10 | 0.07 | 0.07 |
0.005 | 3.58 | 0.39 | 1.48 | 0.68 | 0.83 | 0.89 |
10k first lines of kosarak.dat.gz 10000 transactions and 10094 items, sparse=False
min_support | master(F) | master(T) | master*(F) | master*(T) | 423f9e7 | 423f9e7* |
---|---|---|---|---|---|---|
0.5 | 0.10 | 0.15 | 0.10 | 0.22 | 0.33 | 0.09 |
0.3 | 0.10 | 0.17 | 0.10 | 0.47 | 0.33 | 0.09 |
0.1 | 0.10 | 0.19 | 0.10 | 0.70 | 0.33 | 0.09 |
0.05 | 0.11 | 0.20 | 0.11 | 1.24 | 0.33 | 0.09 |
0.03 | 0.13 | 0.22 | 0.11 | 1.77 | 0.33 | 0.09 |
0.01 | 0.69 | 0.30 | 0.24 | 4.93 | 0.37 | 0.13 |
0.005 | 8.43 | 0.54 | 2.16 | 11.4 | 0.61 | 0.40 |
50k first lines of kosarak.dat.gz 50000 transactions and 18936 items, sparse=False
min_support | master(F) | master(T) | master*(F) | master*(T) | 423f9e7 | 423f9e7* |
---|---|---|---|---|---|---|
0.5 | 0.78 | 1.00 | 1.17 | 1.90 | 2.70 | 0.86 |
0.3 | 0.78 | 1.20 | 0.74 | 4.23 | 2.71 | 0.86 |
0.1 | 0.78 | 1.37 | 1.17 | 6.35 | 2.71 | 0.86 |
0.05 | 0.82 | 1.70 | 0.75 | 11.7 | 2.71 | 0.87 |
0.03 | 0.90 | 1.92 | 1.21 | 16.9 | 2.71 | 0.87 |
0.01 | 5.02 | 2.95 | 1.51 | 46.4 | 2.77 | 0.92 |
0.005 | 51.3 | 4.96 | 12.8 | 107 | 3.12 | 1.30 |
100k first lines of kosarak.dat.gz 100000 transactions and 23496 items, sparse=False
min_support | master(F) | master(T) | master*(F) | master*(T) | 423f9e7 | 423f9e7* |
---|---|---|---|---|---|---|
0.5 | - | 2.83 | - | 4.79 | 7.51 | 2.89 |
0.3 | - | 3.02 | - | 10.7 | 6.43 | 2.89 |
0.1 | - | 3.41 | - | 16.1 | 6.43 | 2.90 |
0.05 | - | 4.44 | - | 29.9 | 6.43 | 2.90 |
0.03 | - | 4.75 | - | 43.0 | 6.43 | 2.90 |
0.01 | - | 7.46 | - | 116 | 6.50 | 2.97 |
0.005 | - | 13.8 | - | NA(2) | 7.03 | 3.52 |
With all these resuls, it is not clear to me whether 1fabe25 is such a good idea, an alternative is to make users aware of the importance of storage order, and they have to check which one is best adapted to their use case.
Sorry, I have been busy with grading and was then traveling over Xmas. The work in this PR is amazing, thanks a lot! Regarding the row vs column format, like you suggest, maybe it would be better to add this prominently in the docstring but have users decide what type of format they provide via the DataFrame.
I just see the unit tests failing. I don't know why this happens to Appveyor, but it seems that switching installations from pip to conda helped. Maybe something got changed in the backend.
Regarding Travis CI, some discrepancies occurred after things got switched to scikit-learn 0.22 in the "latest" version. I addressed this now as well. This has been done in #652. I can do a rebase of this PR if you like, or if you can do it yourself if you prefer -- wanted to ask before I started messing around here ;)
I just rebased but won't be able to take care of failures during 24h, feel free to push fixes.
All good. Based on the error logs, these are "natural" ones due to WIP and not issues with the CI.
There were indeed some bugs; because of these fixes, timings may be slightly different, I will rerun benchmarks in few days.
What do you think about adding a/the benchmark script(s) to mlxtend/frequent_patterns/tests/benchmark.py
(not test_benchmark.py
so it doesn't get executed by default when running pytest?
It may be useful to have this as a reference in case of future modifications to the codebase.
*sry, not sure why this PR was closed. I must have hit some keyboard combination -- never happened before. This was certainly not intentional.
I rearranged commits, they look good now IMHO. About benchmark script, I do not know how to do that, there are many parameters: data files, sparse=True/False, column_major=True/False, and list of min_support argument (which may depend on data files). Anyway, it has been committed.
Should data files be copied into mlxtend/data/data
with a Python function to load them? Here are their size (in bytes):
chess.dat.gz 14K
connect.dat.gz 362K
kosarak-1k.dat.gz 14K
kosarak-10k.dat.gz 126K
kosarak-50k.dat.gz 616K
kosarak-100k.dat.gz 1,3M
kosarak-200k.dat.gz 2,5M
kosarak.dat.gz 13M
mushroom.dat.gz 34K
pumsb.dat.gz 1,3M
pumsb_star.dat.gz 1,2M
T10I4D100K.dat.gz 1,4M
T40I10D100K.dat.gz 4,8M
Here are more benchmarks. In these tables, s=
means sparse
variable, c=
is col_major
and T
/F
is True
/False
.
T10I4D100K.dat.gz 100000 transactions and 870 items, low_memory=True
min_support | 0.05 | 0.03 | 0.01 | 0.005 | 0.003 | 0.001 |
---|---|---|---|---|---|---|
master (s=F,c=T) | 0.26 | 0.92 | 4.0 | 7.2 | 14.5 | 37.5 |
master (s=T,c=T) | 0.06 | 0.24 | 3.9 | 8.9 | 29.3 | NA(3) |
eb80667(s=F,c=T) | 0.01 | 0.05 | 2.0 | 4.8 | 7.3 | 11.8 |
eb80667(s=T,c=T) | 0.05 | 0.44 | 11.3 | 23.5 | 33.5 | 57.2 |
T40I10D100K.dat.gz 100000 transactions and 942 items, low_memory=True
Note that timeout had been expanded to 300s when running commit eb80667, which explains values above 120s.
min_support | 0.1 | 0.05 | 0.03 | 0.01 | 0.005 | 0.003 | 0.001 |
---|---|---|---|---|---|---|---|
master (s=F,c=T) | 2.0 | 7.2 | 15.0 | NA(2) | NA(2) | NA(2) | NA(2) |
master (s=T,c=T) | 0.67 | 5.4 | 13.3 | NA(2) | NA(2) | NA(2) | NA(2) |
eb80667(s=F,c=T) | 0.01 | 1.3 | 3.5 | 20.7 | 176 | NA(6) | NA(2) |
eb80667(s=T,c=T) | 1.32 | 14.0 | 32.9 | 232 | NA(2) | NA(2) | NA(2) |
kosarak-*k.dat.gz, low_memory=True
min_support | 0.5 | 0.3 | 0.1 | 0.05 | 0.03 | 0.01 | 0.005 | 0.003 | 0.001 |
---|---|---|---|---|---|---|---|---|---|
kosarak-1k.dat.gz | |||||||||
eb80667(s=F,c=T) | 0.005 | 0.006 | 0.007 | 0.008 | 0.01 | 0.27 | 0.31 | NA(11) | NA(2) |
eb80667(s=F,c=F) | 0.009 | 0.009 | 0.01 | 0.01 | 0.01 | 0.05 | 0.60 | NA(9) | NA(2) |
eb80667(s=T,c=T) | 0.08 | 0.08 | 0.09 | 0.09 | 0.09 | 0.14 | 0.92 | NA(8) | NA(2) |
kosarak-10k.dat.gz | |||||||||
eb80667(s=F,c=T) | 0.02 | 0.02 | 0.02 | 0.02 | 0.02 | 0.04 | 0.14 | 0.58 | NA(5) |
eb80667(s=F,c=F) | 0.26 | 0.26 | 0.26 | 0.26 | 0.27 | 0.36 | 1.02 | 4.4 | NA(4) |
eb80667(s=T,c=T) | 0.26 | 0.26 | 0.26 | 0.27 | 0.28 | 0.36 | 0.80 | 2.4 | NA(4) |
kosarak-50k.dat.gz | |||||||||
eb80667(s=F,c=T) | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.12 | 0.33 | 1.06 | 61.1 |
eb80667(s=F,c=F) | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.6 | 6.7 | 20.1 | NA(1) |
eb80667(s=T,c=T) | 0.49 | 0.49 | 0.50 | 0.51 | 0.53 | 0.78 | 2.1 | 5.9 | NA(5) |
kosarak-100k.dat.gz | |||||||||
eb80667(s=F,c=T) | 0.20 | 0.20 | 0.20 | 0.20 | 0.20 | 0.24 | 0.60 | 1.85 | 67.9 |
eb80667(s=F,c=F) | 4.9 | 4.9 | 4.9 | 5.0 | 5.1 | 6.4 | 15.7 | - | - |
eb80667(s=T,c=T) | 0.62 | 0.62 | 0.62 | 0.65 | 0.70 | 1.15 | 3.6 | 10.2 | NA(9) |
kosarak-200k.dat.gz | |||||||||
eb80667(s=F,c=T) | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.60 | 1.3 | 3.8 | 123 |
eb80667(s=T,c=T) | 0.80 | 0.80 | 0.80 | 0.85 | 0.92 | 1.83 | 6.3 | 18.8 | NA(5) |
Some remarks:
- Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated. Also note that TransactionEncoder uses a CSR format, which is logical, but it has to be converted to CSC for apriori, and this conversion takes time and memory.
- Implementation does not use hash-trees. The reason is that current API expects a DataFrame, and with hash-trees it seems more logical to directly work on a list of transactions.
Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.
In any case, I am really thankful for all the good work you put into this. This is really awesome. And I can take care of the automatic data downloads from here.
regarding
but it has to be converted to CSC for apriori, and this conversion takes time and memory.
Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then).
Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated.
I am not an expert at this by any means, but I think they are usually only memory efficient but not necessarily efficient when it comes to processing times.
Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.
No worries, this issue is not trivial and requires careful thinking, please take your time.
Other apriori implementations takes as input a list of transactions; there are several optimizations which can then be performed:
- use a sparse representation of this list of transactions (Apriori paper suggests hash-trees, some other authors prefer prefix trees) which speeds up itemset counts (supports)
- remove transactions during processing when they can no more contain frequent itemsets (for instance if their number of items is less than
next_max_itemset
)
Here it is very likely that user loaded her dataset as a list of transactions and called TransactionEncoder
to convert it into a pandas DataFrame in order to call apriori
function, thus it does not make sense to me to let apriori
convert its argument internally back into a list of transactions. IMHO the first design decision to consider is: do you want to keep current input as a pandas DataFrame, or can it be changed to a list of transactions?
If the former, optimizations mentioned above cannot be performed and this PR is almost done. If the latter, a lot more work is needed.
[...]
but it has to be converted to CSC for apriori, and this conversion takes time and memory.
Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then). [...]
Sorry I do not understand your point; dense pandas DataFrame can use either row major or column major storage, and it looks like this depends on which DataFrame constructor had been called(2d array vs dict). We could indeed add a cell to show how to convert input DataFrame to speed up apriori function.
But as far as I can tell, this is different for sparse DataFrame, user has no control on internal storage, and all we can do is to call df.to_coo().tocsc()
. If input argument could be passed as a Numpy array (or scipy sparse matrix), that would be different.
Sorry, still haven't had time to look into this more. Along with the new semester (lots of teaching) & 2 paper deadlines in January, there wasn't time for much else recently. I am currently making a 0.17.1 bugfix release with the recent changes -- someone from industry contacted me about this, because due to company firewalls, several people can only install it from PyPI (not from GitHub directly). Will revisit this PR soon though -- thanks for all the work on it so far!