mlxtend [WIP] Implement apriori-gen as in original paper

Description

Implement apriori-gen as in original Apriori paper. This is a draft PR for discussion; different changes are proposed, they should be benchmarked.

First commit implements join step
Second commit implements prune step
Third commit enforces low_memory=True processing; thanks to previous optimizations, it is now as fast as low_memory=False and requires less memory; frequent itemsets are stored as list of tuples instead of Numpy arrays
Fourth commit replaces trie implementation by set
Fifth commit replaces _support function, which was really slow on some test cases

Related issues or pull requests

Reported in #644.

Pull Request Checklist

[ ] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
[ ] Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
[ ] Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
[ ] Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
[ ] Checked for style issues by running flake8 ./mlxtend

Checklist empty for now since this is a draft pull request.

Dec 18 '19 10:12 dbarbier

Grrr, UI for drafts PR is terrible, I forgot to change PR type to draft :-/

Dec 18 '19 10:12 dbarbier

Hello @dbarbier! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-01-06 13:19:09 UTC

Dec 20 '19 13:12 pep8speaks

Here are some benchmark results; data must first be downloaded from http://fimi.uantwerpen.be/data/ and put inside a data subdirectory. Some runs take a very long time, thus I decided to set a timeout of 120s, in which case NA is reported in tables below, with the itemset size being processed. Script below only works on Linux, I am not sure how to implement timeout on Windows.

Benchmark script

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import pandas as pd
import numpy as np
import gzip
import os
from time import time
import signal
from contextlib import contextmanager


@contextmanager
def timeout(time):
    # Register a function to raise a TimeoutError on the signal.
    signal.signal(signal.SIGALRM, raise_timeout)
    # Schedule the signal to be sent after ``time``.
    signal.alarm(time)

    try:
        yield
    except TimeoutError:
        pass
    finally:
        # Unregister the signal so it won't be triggered
        # if the timeout is not reached.
        signal.signal(signal.SIGALRM, signal.SIG_IGN)


def raise_timeout(signum, frame):
    raise TimeoutError


files = [
    #"chess.dat.gz", 
    "connect.dat.gz",
    "mushroom.dat.gz", 
    "pumsb.dat.gz",
    "pumsb_star.dat.gz",
    # "T10I4D100K.dat.gz",  these 3 files are too large 
    # "T40I10D100K.dat.gz",
    # "kosarak.dat.gz"
]

# Modify these 2 variables
sparse = False
low_memory = True

for filename in files:
    with gzip.open(os.path.join("data", filename)) if filename.endswith(
        ".gz" 
    ) else open(os.path.join("data", filename)) as f:
        data = f.readlines() 
 
    dataset = [list(map(int, line.split())) for line in data] 
    items = np.unique([item for itemset in dataset for item in itemset])
    print(f"{filename} contains {len(dataset)} transactions and {len(items)} items")

    te = TransactionEncoder()
    te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
    if sparse:
        try:
            df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
        except AttributeError:
            # pandas < 0.25
            df = pd.SparseDataFrame(te_ary, columns=te.columns_, default_fill_value=False)
    else:
        df = pd.DataFrame(te_ary, columns=te.columns_)
    df.columns = ["c"+str(i) for i in df.columns]

    for min_support in [0.5, 0.3, 0.1, 0.05, 0.03, 0.01, 0.005]:
        tick = time()
        with timeout(120):
            print(apriori(df, min_support=min_support, verbose=1, use_colnames=False, low_memory=low_memory))
        print(f"\nmin_support={min_support} temps: {time() - tick}\n")

Some commits have either an asterisk (*) or a plus sign (+), as well as (F) or (T).

An asterisk tells that code has been modified to operate on smaller matrix blocks, because there would otherwise be memory errors on my machine. All such commits have the same block size, so their relative speed is still relevant.
A plus sign tells that 1 should be added to number between parentheses; the reason is that commit 4c82dcf displays the number of combinations after processing them, instead of before, so the last number is missing when timeout is reached.
F or T refers to low_memory argument (either False or True)

With dense DataFrame:

connect.dat.gz 67557 transactions and 129 items

min_support	master*(F)	master(T)	96dfd4d*(F)	4efeb8c*(F)	b90a146*(F)	1fabe25*(F)	1fabe25(T)	58c95f1(F)	58c95f1(T)	4c82dcf+	423f9e7+
0.5	NA(5)	NA(4)	NA(5)	NA(5)	NA(5)	NA(6)	NA(4)	NA(7)	NA(7)	NA+(6)	NA+(6)
0.3	NA(5)	NA(4)	NA(4)	NA(5)	NA(5)	NA(6)	NA(4)	NA(6)	NA(6)	NA+(5)	NA+(5)
0.1	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(5)	NA(4)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.05	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.03	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.01	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.005	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(4)	NA(4)	NA+(4)	NA+(4)

mushroom.dat.gz 8124 transactions and 119 items

min_support	master*(F)	master(T)	96dfd4d*(F)	4efeb8c*(F)	b90a146*(F)	1fabe25*(F)	1fabe25(T)	58c95f1(F)	58c95f1(T)	4c82dcf+	423f9e7+
0.5	0.04	0.05	0.04	0.04	0.04	0.02	0.09	0.02	0.02	0.02	0.02
0.3	1.24	0.64	0.38	0.35	0.37	0.13	0.89	0.11	0.11	0.13	0.10
0.1	NA(7)	NA(12)	107	103	109	35	89	32	33	38	25
0.05	NA(6)	NA(7)	NA(7)	NA(7)	NA(7)	NA(7)	NA(7)	NA(9)	NA(9)	NA+(8)	NA+(9)
0.03	NA(5)	NA(6)	NA(6)	NA(6)	NA(6)	NA(7)	NA(7)	NA(7)	NA(7)	NA+(7)	NA+(7)
0.01	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(6)	NA(6)	NA(6)	NA(6)	NA+(5)	NA+(6)
0.005	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA+(5)	NA+(5)

pumsb.dat.gz 49046 transactions and 2113 items

min_support	master*(F)	master(T)	96dfd4d*(F)	4efeb8c*(F)	b90a146*(F)	1fabe25*(F)	1fabe25(T)	58c95f1(F)	58c95f1(T)	4c82dcf+	423f9e7+
0.5	NA(5)	NA(3)	NA(5)	NA(5)	NA(5)	NA(6)	NA(3)	NA(6)	NA(6)	NA+(6)	NA+(6)
0.3	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(5)	NA(2)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.1	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.05	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(3)	NA+(3)
0.03	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(3)	NA+(3)
0.01	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(2)	NA(3)	NA(3)	NA+(3)	NA+(3)
0.005	NA(3)	NA(2)	NA(3)	NA(3)	NA(3)	NA(3)	NA(2)	NA(3)	NA(3)	NA+(3)	NA+(3)

pumsb_star.dat.gz 49046 transactions and 2088 items

min_support	master*(F)	master(T)	96dfd4d*(F)	4efeb8c*(F)	b90a146*(F)	1fabe25*(F)	1fabe25(T)	58c95f1(F)	58c95f1(T)	4c82dcf+	423f9e7
0.5	3.76	9.40	0.76	0.73	0.59	0.46	53	0.34	0.37	0.34	0.35
0.3	NA(5)	NA(4)	NA(7)	NA(7)	NA(7)	66	NA(3)	35	36	41	31
0.1	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(5)	NA(2)	NA(5)	NA(5)	NA+(5)	NA+(5)
0.05	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(2)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.03	NA(3)	NA(3)	NA(4)	NA(3)	NA(4)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.01	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(3)	NA+(3)
0.005	NA(3)	NA(2)	NA(3)	NA(3)	NA(3)	NA(3)	NA(2)	NA(3)	NA(3)	NA+(3)	NA+(3)

With sparse DataFrame

connect.dat.gz 67557 transactions and 129 items

min_support	master(T)	58c95f1(T)	4c82dcf+	423f9e7+
0.5	NA(4)	NA(4)	NA+(4)	NA+(4)
0.3	NA(4)	NA(4)	NA+(3)	NA+(3)
0.1	NA(3)	NA(3)	NA+(3)	NA+(3)
0.05	NA(3)	NA(3)	NA+(3)	NA+(3)
0.03	NA(3)	NA(3)	NA+(3)	NA+(3)
0.01	NA(3)	NA(3)	NA+(3)	NA+(3)
0.005	NA(3)	NA(3)	NA+(3)	NA+(3)

mushroom.dat.gz 8124 transactions and 119 items

min_support	master(T)	58c95f1(T)	4c82dcf+	423f9e7+
0.5	0.18	0.18	0.07	0.06
0.3	3.07	3.19	0.92	0.84
0.1	NA(6)	NA(6)	NA+(7)	NA+(8)
0.05	NA(5)	NA(5)	NA+(5)	NA+(5)
0.03	NA(5)	NA(5)	NA+(5)	NA+(5)
0.01	NA(4)	NA(4)	NA+(4)	NA+(4)
0.005	NA(4)	NA(4)	NA+(4)	NA+(4)

pumsb.dat.gz 49046 transactions and 2113 items

min_support	master(T)	58c95f1(T)	4c82dcf+	423f9e7+
0.5	NA(4)	NA(4)	NA+(3)	NA+(3)
0.3	NA(3)	NA(3)	NA+(3)	NA+(3)
0.1	NA(3)	NA(3)	NA+(2)	NA+(2)
0.05	NA(3)	NA(3)	NA+(2)	NA+(2)
0.03	NA(3)	NA(2)	NA+(2)	NA+(2)
0.01	NA(2)	NA(2)	NA+(2)	NA+(2)
0.005	NA(2)	NA(2)	NA+(2)	NA+(2)

pumsb_star.dat.gz 49046 transactions and 2088 items

min_support	master(T)	58c95f1(T)	4c82dcf+	423f9e7+
0.5	2.97	3.08	1.16	1.16
0.3	NA(5)	NA(5)	NA+(5)	NA+(5)
0.1	NA(3)	NA(3)	NA+(3)	NA+(3)
0.05	NA(3)	NA(3)	NA+(3)	NA+(3)
0.03	NA(3)	NA(3)	NA+(3)	NA+(3)
0.01	NA(3)	NA(3)	NA+(2)	NA+(2)
0.005	NA(2)	NA(2)	NA+(2)	NA+(2)

Dec 20 '19 13:12 dbarbier

More tests should be run, maybe with kosarak.dat; when the best option is decided, I will remove/squash/rearrange commits and update docstrings. IMHO the best option is with current head.

Dec 20 '19 14:12 dbarbier

Okay, I ran some tests with kosarak.dat and sparse=False. For small inputs, there is a noticeable overhead, which is caused by np.asfortranarray. All these tests had been run with a DataFrame wih row major storage format (order='C' in Numpy). In tables below, an asterisk means that run is performed with a DataFrame already in column storage format, so that np.asfortranarry becomes a noop; this is achieved by adding

df = pd.DataFrame({col: df[col] for col in df.columns})

(F) and (T) still refers to low_memory=False and low_memory=True arguments respectively.

1k first lines of kosarak.dat.gz 1000 transactions and 3259 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	423f9e7	423f9e7*
0.5	0.01	0.01	0.01	0.01	0.01	0.01
0.3	0.01	0.01	0.01	0.02	0.01	0.01
0.1	0.01	0.01	0.01	0.02	0.01	0.01
0.05	0.02	0.02	0.02	0.04	0.02	0.02
0.03	0.02	0.02	0.02	0.04	0.02	0.02
0.01	0.11	0.04	0.06	0.10	0.07	0.07
0.005	3.58	0.39	1.48	0.68	0.83	0.89

10k first lines of kosarak.dat.gz 10000 transactions and 10094 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	423f9e7	423f9e7*
0.5	0.10	0.15	0.10	0.22	0.33	0.09
0.3	0.10	0.17	0.10	0.47	0.33	0.09
0.1	0.10	0.19	0.10	0.70	0.33	0.09
0.05	0.11	0.20	0.11	1.24	0.33	0.09
0.03	0.13	0.22	0.11	1.77	0.33	0.09
0.01	0.69	0.30	0.24	4.93	0.37	0.13
0.005	8.43	0.54	2.16	11.4	0.61	0.40

50k first lines of kosarak.dat.gz 50000 transactions and 18936 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	423f9e7	423f9e7*
0.5	0.78	1.00	1.17	1.90	2.70	0.86
0.3	0.78	1.20	0.74	4.23	2.71	0.86
0.1	0.78	1.37	1.17	6.35	2.71	0.86
0.05	0.82	1.70	0.75	11.7	2.71	0.87
0.03	0.90	1.92	1.21	16.9	2.71	0.87
0.01	5.02	2.95	1.51	46.4	2.77	0.92
0.005	51.3	4.96	12.8	107	3.12	1.30

100k first lines of kosarak.dat.gz 100000 transactions and 23496 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	423f9e7	423f9e7*
0.5	-	2.83	-	4.79	7.51	2.89
0.3	-	3.02	-	10.7	6.43	2.89
0.1	-	3.41	-	16.1	6.43	2.90
0.05	-	4.44	-	29.9	6.43	2.90
0.03	-	4.75	-	43.0	6.43	2.90
0.01	-	7.46	-	116	6.50	2.97
0.005	-	13.8	-	NA(2)	7.03	3.52

With all these resuls, it is not clear to me whether 1fabe25 is such a good idea, an alternative is to make users aware of the importance of storage order, and they have to check which one is best adapted to their use case.

Dec 20 '19 17:12 dbarbier

Sorry, I have been busy with grading and was then traveling over Xmas. The work in this PR is amazing, thanks a lot! Regarding the row vs column format, like you suggest, maybe it would be better to add this prominently in the docstring but have users decide what type of format they provide via the DataFrame.

Dec 28 '19 08:12 rasbt

I just see the unit tests failing. I don't know why this happens to Appveyor, but it seems that switching installations from pip to conda helped. Maybe something got changed in the backend.

Regarding Travis CI, some discrepancies occurred after things got switched to scikit-learn 0.22 in the "latest" version. I addressed this now as well. This has been done in #652. I can do a rebase of this PR if you like, or if you can do it yourself if you prefer -- wanted to ask before I started messing around here ;)

Dec 28 '19 08:12 rasbt

I just rebased but won't be able to take care of failures during 24h, feel free to push fixes.

Dec 28 '19 18:12 dbarbier

All good. Based on the error logs, these are "natural" ones due to WIP and not issues with the CI.

Dec 29 '19 12:12 rasbt

There were indeed some bugs; because of these fixes, timings may be slightly different, I will rerun benchmarks in few days.

Dec 30 '19 10:12 dbarbier

What do you think about adding a/the benchmark script(s) to mlxtend/frequent_patterns/tests/benchmark.py (not test_benchmark.py so it doesn't get executed by default when running pytest?

It may be useful to have this as a reference in case of future modifications to the codebase.

Jan 02 '20 17:01 rasbt

*sry, not sure why this PR was closed. I must have hit some keyboard combination -- never happened before. This was certainly not intentional.

Jan 02 '20 17:01 rasbt

I rearranged commits, they look good now IMHO. About benchmark script, I do not know how to do that, there are many parameters: data files, sparse=True/False, column_major=True/False, and list of min_support argument (which may depend on data files). Anyway, it has been committed.

Should data files be copied into mlxtend/data/data with a Python function to load them? Here are their size (in bytes):

  chess.dat.gz         14K
  connect.dat.gz      362K
  kosarak-1k.dat.gz    14K
  kosarak-10k.dat.gz  126K
  kosarak-50k.dat.gz  616K
  kosarak-100k.dat.gz 1,3M
  kosarak-200k.dat.gz 2,5M
  kosarak.dat.gz       13M
  mushroom.dat.gz      34K
  pumsb.dat.gz        1,3M
  pumsb_star.dat.gz   1,2M
  T10I4D100K.dat.gz   1,4M
  T40I10D100K.dat.gz  4,8M

Jan 03 '20 16:01 dbarbier

Here are more benchmarks. In these tables, s= means sparse variable, c= is col_major and T/F is True/False.

T10I4D100K.dat.gz 100000 transactions and 870 items, low_memory=True

min_support	0.05	0.03	0.01	0.005	0.003	0.001
master (s=F,c=T)	0.26	0.92	4.0	7.2	14.5	37.5
master (s=T,c=T)	0.06	0.24	3.9	8.9	29.3	NA(3)
eb80667(s=F,c=T)	0.01	0.05	2.0	4.8	7.3	11.8
eb80667(s=T,c=T)	0.05	0.44	11.3	23.5	33.5	57.2

T40I10D100K.dat.gz 100000 transactions and 942 items, low_memory=True

Note that timeout had been expanded to 300s when running commit eb80667, which explains values above 120s.

min_support	0.1	0.05	0.03	0.01	0.005	0.003	0.001
master (s=F,c=T)	2.0	7.2	15.0	NA(2)	NA(2)	NA(2)	NA(2)
master (s=T,c=T)	0.67	5.4	13.3	NA(2)	NA(2)	NA(2)	NA(2)
eb80667(s=F,c=T)	0.01	1.3	3.5	20.7	176	NA(6)	NA(2)
eb80667(s=T,c=T)	1.32	14.0	32.9	232	NA(2)	NA(2)	NA(2)

kosarak-*k.dat.gz, low_memory=True

min_support	0.5	0.3	0.1	0.05	0.03	0.01	0.005	0.003	0.001
kosarak-1k.dat.gz
eb80667(s=F,c=T)	0.005	0.006	0.007	0.008	0.01	0.27	0.31	NA(11)	NA(2)
eb80667(s=F,c=F)	0.009	0.009	0.01	0.01	0.01	0.05	0.60	NA(9)	NA(2)
eb80667(s=T,c=T)	0.08	0.08	0.09	0.09	0.09	0.14	0.92	NA(8)	NA(2)
kosarak-10k.dat.gz
eb80667(s=F,c=T)	0.02	0.02	0.02	0.02	0.02	0.04	0.14	0.58	NA(5)
eb80667(s=F,c=F)	0.26	0.26	0.26	0.26	0.27	0.36	1.02	4.4	NA(4)
eb80667(s=T,c=T)	0.26	0.26	0.26	0.27	0.28	0.36	0.80	2.4	NA(4)
kosarak-50k.dat.gz
eb80667(s=F,c=T)	0.09	0.09	0.09	0.09	0.09	0.12	0.33	1.06	61.1
eb80667(s=F,c=F)	2.0	2.0	2.0	2.0	2.0	2.6	6.7	20.1	NA(1)
eb80667(s=T,c=T)	0.49	0.49	0.50	0.51	0.53	0.78	2.1	5.9	NA(5)
kosarak-100k.dat.gz
eb80667(s=F,c=T)	0.20	0.20	0.20	0.20	0.20	0.24	0.60	1.85	67.9
eb80667(s=F,c=F)	4.9	4.9	4.9	5.0	5.1	6.4	15.7	-	-
eb80667(s=T,c=T)	0.62	0.62	0.62	0.65	0.70	1.15	3.6	10.2	NA(9)
kosarak-200k.dat.gz
eb80667(s=F,c=T)	0.51	0.51	0.51	0.51	0.51	0.60	1.3	3.8	123
eb80667(s=T,c=T)	0.80	0.80	0.80	0.85	0.92	1.83	6.3	18.8	NA(5)

Jan 03 '20 17:01 dbarbier

Some remarks:

Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated. Also note that TransactionEncoder uses a CSR format, which is logical, but it has to be converted to CSC for apriori, and this conversion takes time and memory.
Implementation does not use hash-trees. The reason is that current API expects a DataFrame, and with hash-trees it seems more logical to directly work on a list of transactions.

Jan 05 '20 10:01 dbarbier

Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.

In any case, I am really thankful for all the good work you put into this. This is really awesome. And I can take care of the automatic data downloads from here.

regarding

but it has to be converted to CSC for apriori, and this conversion takes time and memory.

Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then).

Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated.

I am not an expert at this by any means, but I think they are usually only memory efficient but not necessarily efficient when it comes to processing times.

Jan 07 '20 04:01 rasbt

Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.

No worries, this issue is not trivial and requires careful thinking, please take your time.

Other apriori implementations takes as input a list of transactions; there are several optimizations which can then be performed:

use a sparse representation of this list of transactions (Apriori paper suggests hash-trees, some other authors prefer prefix trees) which speeds up itemset counts (supports)
remove transactions during processing when they can no more contain frequent itemsets (for instance if their number of items is less than next_max_itemset)

Here it is very likely that user loaded her dataset as a list of transactions and called TransactionEncoder to convert it into a pandas DataFrame in order to call apriori function, thus it does not make sense to me to let apriori convert its argument internally back into a list of transactions. IMHO the first design decision to consider is: do you want to keep current input as a pandas DataFrame, or can it be changed to a list of transactions?

If the former, optimizations mentioned above cannot be performed and this PR is almost done. If the latter, a lot more work is needed.

[...]

but it has to be converted to CSC for apriori, and this conversion takes time and memory.

Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then). [...]

Sorry I do not understand your point; dense pandas DataFrame can use either row major or column major storage, and it looks like this depends on which DataFrame constructor had been called(2d array vs dict). We could indeed add a cell to show how to convert input DataFrame to speed up apriori function. But as far as I can tell, this is different for sparse DataFrame, user has no control on internal storage, and all we can do is to call df.to_coo().tocsc(). If input argument could be passed as a Numpy array (or scipy sparse matrix), that would be different.

Jan 07 '20 09:01 dbarbier

Sorry, still haven't had time to look into this more. Along with the new semester (lots of teaching) & 2 paper deadlines in January, there wasn't time for much else recently. I am currently making a 0.17.1 bugfix release with the recent changes -- someone from industry contacted me about this, because due to company firewalls, several people can only install it from PyPI (not from GitHub directly). Will revisit this PR soon though -- thanks for all the work on it so far!

Jan 29 '20 02:01 rasbt

mlxtend mlxtend copied to clipboard

[WIP] Implement apriori-gen as in original paper

Description

Related issues or pull requests

Pull Request Checklist

Comment last updated at 2020-01-06 13:19:09 UTC

mlxtend
mlxtend copied to clipboard