Kaggle notebooks analysis
This is a first version of the analysis of pandas usage in Kaggle notebooks.
We've fetched Python notebooks from Kaggle and we run them using record_api to analyze the number of calls to the main objects of the pandas API. A total of 895 notebooks could be analyzed.
In a separate column, information about the page views in the pandas documentation has been added. The page views are normalized by 1,000 (so the page with more views in the pandas documentation would have a value of 1,000 in the column).
For simplicity, only the attributes of DataFrame, Series and the pandas top-level module have been merged. So, pandas.sum(), Series.sum() and DataFrame.sum() would appear in the list as simply sum.
The different sections are to help reading the document, and not an "official" categorization of the API. Feedback is welcome if something feels misplaced.
The source code to generate the table is available at this repo.
Top 25 called methods
Notes:
- Operators (e.g.
__add__) are merged with their equivalent method (e.g.add) -
__getitem__is both used to access a columndf[col]and to filterdf[condition] - Accessing a call is also possible via
__getattr__(e.g.df.col_name), but this has not been captured
| Object | Kaggle calls |
|---|---|
__getitem__ |
143992 |
__setitem__ |
40059 |
eq |
3018 |
mul |
2799 |
add |
2768 |
groupby |
2267 |
loc |
1667 |
drop |
1618 |
fillna |
1609 |
columns |
1583 |
head |
1575 |
truediv |
1442 |
shape |
1267 |
sub |
1144 |
isnull |
1057 |
sort_values |
1015 |
and |
957 |
values |
953 |
sum |
898 |
astype |
728 |
value_counts |
706 |
index |
664 |
gt |
622 |
apply |
538 |
to_frame |
479 |
Main items by category
Data summary and info
| Object | Kaggle calls | Docs views |
|---|---|---|
info |
275 | 22 |
empty |
0 | 32 |
describe |
303 | 146 |
value_counts |
706 | 161 |
dtypes |
175 | 64 |
memory_usage |
83 | 2 |
ndim |
0 | 1 |
shape |
1267 | 17 |
size |
3 | 45 |
values |
953 | 113 |
attrs |
0 | 0 |
array |
0 | 0 |
unique |
193 | 106 |
dtype |
149 | 8 |
nbytes |
0 | 0 |
Indexing
| Object | Kaggle calls | Docs views |
|---|---|---|
__getitem__ |
143992 | 0 |
__setitem__ |
40059 | 0 |
axes |
0 | 4 |
columns |
1583 | 31 |
set_index |
72 | 278 |
swapaxes |
0 | 0 |
select_dtypes |
180 | 36 |
lookup |
0 | 11 |
xs |
5 | 16 |
loc |
1667 | 232 |
iloc |
427 | 122 |
index |
664 | 164 |
reindex |
11 | 136 |
reindex_like |
0 | 2 |
reset_index |
305 | 279 |
add_prefix |
16 | 6 |
add_suffix |
0 | 3 |
get |
0 | 16 |
iat |
1 | 17 |
keys |
13 | 16 |
at |
4 | 40 |
filter |
3 | 170 |
rename |
401 | 355 |
rename_axis |
0 | 13 |
idxmax |
7 | 49 |
idxmin |
0 | 10 |
droplevel |
0 | 0 |
truncate |
0 | 7 |
swaplevel |
0 | 7 |
take |
0 | 5 |
reorder_levels |
0 | 5 |
sort_index |
32 | 90 |
set_axis |
0 | 1 |
pop |
14 | 9 |
searchsorted |
0 | 3 |
name |
113 | 13 |
item |
0 | 3 |
argmax |
0 | 2 |
argmin |
0 | 1 |
argsort |
0 | 3 |
Filter, select, sort
| Object | Kaggle calls | Docs views |
|---|---|---|
nlargest |
25 | 17 |
nsmallest |
1 | 8 |
head |
1575 | 108 |
tail |
60 | 12 |
drop_duplicates |
20 | 194 |
sort_values |
1015 | 457 |
sample |
63 | 102 |
query |
12 | 69 |
Operators
| Object | Kaggle calls | Docs views |
|---|---|---|
add |
2768 | 104 |
div |
2 | 10 |
dot |
0 | 9 |
eq |
3018 | 1 |
equals |
0 | 35 |
floordiv |
3 | 0 |
ge |
68 | 1 |
gt |
622 | 1 |
le |
197 | 0 |
lt |
8 | 0 |
mod |
11 | 1 |
mul |
2799 | 4 |
ne |
163 | 1 |
pow |
29 | 2 |
product |
0 | 3 |
radd |
0 | 6 |
rdiv |
0 | 0 |
rfloordiv |
0 | 0 |
rmod |
0 | 0 |
rmul |
0 | 2 |
rpow |
0 | 0 |
rsub |
0 | 2 |
rtruediv |
0 | 2 |
sub |
1144 | 7 |
truediv |
1442 | 0 |
Missing values
| Object | Kaggle calls | Docs views |
|---|---|---|
isnull |
1057 | 90 |
notnull |
60 | 40 |
dropna |
193 | 346 |
fillna |
1609 | 248 |
interpolate |
3 | 39 |
isna |
108 | 27 |
notna |
5 | 11 |
hasnans |
0 | 0 |
Map
| Object | Kaggle calls | Docs views |
|---|---|---|
cut |
59 | 84 |
eval |
0 | 12 |
corrwith |
1 | 11 |
applymap |
2 | 49 |
astype |
728 | 234 |
rank |
2 | 34 |
clip |
4 | 13 |
where |
10 | 105 |
mask |
14 | 25 |
combine |
0 | 12 |
combine_first |
0 | 11 |
isin |
86 | 138 |
abs |
25 | 12 |
replace |
463 | 216 |
apply |
538 | 379 |
round |
14 | 68 |
transform |
10 | 39 |
factorize |
3 | 15 |
map |
420 | 91 |
between |
1 | 12 |
Reduce
| Object | Kaggle calls | Docs views |
|---|---|---|
cov |
0 | 9 |
quantile |
47 | 78 |
var |
4 | 11 |
skew |
88 | 5 |
std |
140 | 39 |
sum |
898 | 114 |
kurt |
60 | 1 |
kurtosis |
23 | 3 |
count |
109 | 107 |
max |
131 | 70 |
mean |
390 | 107 |
median |
228 | 21 |
min |
107 | 26 |
mode |
205 | 18 |
prod |
1 | 1 |
nunique |
15 | 27 |
all |
9 | 16 |
any |
87 | 22 |
mad |
3 | 2 |
sem |
0 | 2 |
corr |
239 | 105 |
is_monotonic |
0 | 0 |
is_monotonic_decreasing |
0 | 0 |
is_monotonic_increasing |
0 | 0 |
is_unique |
0 | 1 |
cov |
0 | 9 |
autocorr |
0 | 7 |
quantile |
47 | 78 |
Misc
| Object | Kaggle calls | Docs views |
|---|---|---|
iterrows |
39 | 102 |
style |
84 | 76 |
itertuples |
0 | 36 |
bool |
0 | 5 |
squeeze |
0 | 2 |
update |
8 | 56 |
pipe |
3 | 7 |
__iter__ |
0 | 1 |
items |
1 | 6 |
iteritems |
3 | 37 |
view |
0 | 0 |
Reshape / Join / Concat...
| Object | Kaggle calls | Docs views |
|---|---|---|
get_dummies |
258 | 152 |
crosstab |
58 | 40 |
concat |
432 | 315 |
merge_asof |
0 | 16 |
merge_ordered |
0 | 4 |
wide_to_long |
0 | 7 |
pivot |
29 | 95 |
pivot_table |
54 | 144 |
join |
159 | 225 |
melt |
18 | 75 |
stack |
0 | 36 |
transpose |
9 | 76 |
assign |
19 | 74 |
insert |
17 | 57 |
merge |
425 | 413 |
drop |
1618 | 625 |
explode |
0 | 0 |
align |
3 | 10 |
append |
439 | 515 |
T |
55 | 6 |
unstack |
17 | 58 |
repeat |
0 | 5 |
ravel |
0 | 5 |
Group
| Object | Kaggle calls | Docs views |
|---|---|---|
agg |
0 | 16 |
aggregate |
3 | 58 |
groupby |
2267 | 719 |
Window
| Object | Kaggle calls | Docs views |
|---|---|---|
cummax |
0 | 2 |
cummin |
0 | 0 |
cumprod |
0 | 5 |
cumsum |
8 | 29 |
pct_change |
0 | 34 |
rolling |
42 | 140 |
ewm |
0 | 33 |
expanding |
0 | 11 |
duplicated |
14 | 90 |
diff |
1 | 54 |
Thanks for doing this analysis!
There are a few places in the table with 0 as recorded usage, but there might be actually some calls to them. For example, the month property of the DatetimeProperties class that has some hits: https://github.com/pydata-apis/dataframe-tools/blob/master/kaggle/results/record_api_results_infer_api.json#L12914-L12924.
You can also set the PYTHON_RECORD_API_LABEL env variable to something other than pandas to like the notebook name, or some unique ID, so that you can see how many calls came from each differente notebook.