mlxtend
mlxtend copied to clipboard
refactor: replace pandas apply with more efficient means (#655)
Description
Goal: Replacing pandas' .apply wherever possible.
What I did:
-
Checked all functions that use .apply. Though there is no place that where/select can be leveraged, using Numpy's vectorize could speed up functions greatly.
-
Improve the runtime performance for apriori function and tests by replacing pandas' .apply with numpy's vectorize as the data (see below) show vectorize is faster than apply for the current use.
-
Improve the efficiency for generate_itemsets function by replacing Python lists with Numpy's arrays and replacing iterative division with array division.
Related issues or pull requests
Refactor #655
Pull Request Checklist
- [x] Added a note about the modification or contribution to the
./docs/sources/CHANGELOG.mdfile (if applicable) - [x] Added appropriate unit test functions in the
./mlxtend/*/testsdirectories (if applicable) - [x] Modify documentation in the corresponding Jupyter Notebook under
mlxtend/docs/sources/(if applicable) - [x] Ran
PYTHONPATH='.' pytest ./mlxtend -svand make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv) - [x] Checked for style issues by running
flake8 ./mlxtend
Performance comparison
Data Preperation
import pandas as pd
import numpy as np
frozenset_array = np.array([frozenset([3]), frozenset([10]), frozenset([8]), frozenset([6]), frozenset([3,5]),
frozenset([10,5]), frozenset([8,3]), frozenset([8,5]), frozenset([8,3,5]), frozenset([5,6])])
# x = np.repeat(x, 500)
df = pd.DataFrame(frozenset_array, columns=['col1'])
len
# current
%timeit -n 100 np.max(df.col1.apply(len))
322 µs ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(len)(df.col1).max()
73.6 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
set
# current
%timeit -n 100 df['col1'].apply(lambda x: set(x))
229 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(set)(df.col1)
65.9 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
frozenset to str
# current
%timeit -n 100 df['col1'].apply(lambda x: str(x))
243 µs ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(str)(df.col1)
85.3 µs ± 8.76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
colname map
colname_map = {0: 'Apple', 1: 'Corn', 2: 'Dill', 3: 'Eggs', 4: 'Ice cream', 5: 'Kidney Beans', 6: 'Milk', 7: 'Nutmeg', 8: 'Onion', 9: 'Unicorn', 10: 'Yogurt'}
# current
%timeit -n 100 df['col1'].apply(lambda x: frozenset([colname_map[i] for i in x]))
225 µs ± 27.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 [frozenset(a) for a in np.vectorize(map)(colname_map.get, df['col1'])]
75.1 µs ± 6.93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cast each element to stringfied frozenset
# copy the data below from tests
cols = ['antecedents', 'consequents', 'antecedent support', 'consequent support', 'support', 'confidence', 'lift', 'leverage', 'conviction']
expect = pd.DataFrame([
[(8,), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
[(6,), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
[(8, 3), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
[(8, 5), (3,), 0.6, 0.8, 0.6, 1.0, 1.25, 0.12, np.inf],
[(8,), (3, 5), 0.6, 0.8, 0.6, 1.0, 1.25, 0.12, np.inf],
[(3,), (5,), 0.8, 1.0, 0.8, 1.0, 1.0, 0.0, np.inf],
[(5,), (3,), 1.0, 0.8, 0.8, 0.8, 1.0, 0.0, 1.0],
[(10,), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
[(8,), (3,), 0.6, 0.8, 0.6, 1.0, 1.25, 0.12, np.inf]],
columns=cols
)
# current
%timeit -n 100 expect['antecedents'].apply(lambda x: str(frozenset(x)))
%timeit -n 100 expect['consequents'].apply(lambda x: str(frozenset(x)))
245 µs ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
214 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(str)(np.vectorize(frozenset)(expect['antecedents']))
%timeit -n 100 np.vectorize(str)(np.vectorize(frozenset)(expect['consequents']))
103 µs ± 6.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
86.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hello @keyanyang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2020-05-01 01:49:14 UTC
Thanks a lot for your PR! Just to get an idea of how much the efficiency has improved, I just ran the code against some benchmark datasets from the website http://fimi.uantwerpen.be/data/ that @dbarbier shared via #646 (I related PR that I really want to revisit at some point, sorry for the delay @dbarbier , it has been a very stressful semester).
I ran it as apriori(df, min_support=0.6) on pumsb and as apriori(df, min_support=0.1) on connect`. The first number is the original implementation, and the second number is the one submitted in this PR:
- chess: 40.9 sec vs 41.9 sec
- pumsb (first 50 rows and first 50 columns): 28.1 vs 27.4
The code snippets for loading the unzipped datasets are
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
df = pd.read_csv('chess.dat.txt', sep='\s+', header=None)
te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()
and
df = pd.read_csv('pumsb.dat.txt', sep='\s+', header=None)
df = df.iloc[:20, :20]
te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)
df = pd.DataFrame(te_ary, columns=te.columns_)
It looks like that the runtime wasn't really affected by the change from apply to vectorize+map -- I guess that's because the bottleneck is more in the combination generation section.
I propose to leave this PR open for now and apply these changes to #646 later, which is a big overhaul to of the apriori implementation by @dbarbier -- I am hoping to find some hours of uninterrupted time to take a closer look at #646 this summer. In the meantime, if you are interested, it would be great to get some additional feedback (and pair of eyes) on #646 :)