Kaggle notebooks analysis

Open datapythonista opened this issue 5 years ago • 1 comments

This is a first version of the analysis of pandas usage in Kaggle notebooks.

We've fetched Python notebooks from Kaggle and we run them using record_api to analyze the number of calls to the main objects of the pandas API. A total of 895 notebooks could be analyzed.

In a separate column, information about the page views in the pandas documentation has been added. The page views are normalized by 1,000 (so the page with more views in the pandas documentation would have a value of 1,000 in the column).

For simplicity, only the attributes of DataFrame, Series and the pandas top-level module have been merged. So, pandas.sum(), Series.sum() and DataFrame.sum() would appear in the list as simply sum.

The different sections are to help reading the document, and not an "official" categorization of the API. Feedback is welcome if something feels misplaced.

The source code to generate the table is available at this repo.

Top 25 called methods

Notes:

Operators (e.g. __add__) are merged with their equivalent method (e.g. add)
__getitem__ is both used to access a column df[col] and to filter df[condition]
Accessing a call is also possible via __getattr__ (e.g. df.col_name), but this has not been captured

Object	Kaggle calls
`__getitem__`	143992
`__setitem__`	40059
`eq`	3018
`mul`	2799
`add`	2768
`groupby`	2267
`loc`	1667
`drop`	1618
`fillna`	1609
`columns`	1583
`head`	1575
`truediv`	1442
`shape`	1267
`sub`	1144
`isnull`	1057
`sort_values`	1015
`and`	957
`values`	953
`sum`	898
`astype`	728
`value_counts`	706
`index`	664
`gt`	622
`apply`	538
`to_frame`	479

Main items by category

Data summary and info

Object	Kaggle calls	Docs views
`info`	275	22
`empty`	0	32
`describe`	303	146
`value_counts`	706	161
`dtypes`	175	64
`memory_usage`	83	2
`ndim`	0	1
`shape`	1267	17
`size`	3	45
`values`	953	113
`attrs`	0	0
`array`	0	0
`unique`	193	106
`dtype`	149	8
`nbytes`	0	0

Indexing

Object	Kaggle calls	Docs views
`__getitem__`	143992	0
`__setitem__`	40059	0
`axes`	0	4
`columns`	1583	31
`set_index`	72	278
`swapaxes`	0	0
`select_dtypes`	180	36
`lookup`	0	11
`xs`	5	16
`loc`	1667	232
`iloc`	427	122
`index`	664	164
`reindex`	11	136
`reindex_like`	0	2
`reset_index`	305	279
`add_prefix`	16	6
`add_suffix`	0	3
`get`	0	16
`iat`	1	17
`keys`	13	16
`at`	4	40
`filter`	3	170
`rename`	401	355
`rename_axis`	0	13
`idxmax`	7	49
`idxmin`	0	10
`droplevel`	0	0
`truncate`	0	7
`swaplevel`	0	7
`take`	0	5
`reorder_levels`	0	5
`sort_index`	32	90
`set_axis`	0	1
`pop`	14	9
`searchsorted`	0	3
`name`	113	13
`item`	0	3
`argmax`	0	2
`argmin`	0	1
`argsort`	0	3

Filter, select, sort

Object	Kaggle calls	Docs views
`nlargest`	25	17
`nsmallest`	1	8
`head`	1575	108
`tail`	60	12
`drop_duplicates`	20	194
`sort_values`	1015	457
`sample`	63	102
`query`	12	69

Operators

Object	Kaggle calls	Docs views
`add`	2768	104
`div`	2	10
`dot`	0	9
`eq`	3018	1
`equals`	0	35
`floordiv`	3	0
`ge`	68	1
`gt`	622	1
`le`	197	0
`lt`	8	0
`mod`	11	1
`mul`	2799	4
`ne`	163	1
`pow`	29	2
`product`	0	3
`radd`	0	6
`rdiv`	0	0
`rfloordiv`	0	0
`rmod`	0	0
`rmul`	0	2
`rpow`	0	0
`rsub`	0	2
`rtruediv`	0	2
`sub`	1144	7
`truediv`	1442	0

Missing values

Object	Kaggle calls	Docs views
`isnull`	1057	90
`notnull`	60	40
`dropna`	193	346
`fillna`	1609	248
`interpolate`	3	39
`isna`	108	27
`notna`	5	11
`hasnans`	0	0

Map

Object	Kaggle calls	Docs views
`cut`	59	84
`eval`	0	12
`corrwith`	1	11
`applymap`	2	49
`astype`	728	234
`rank`	2	34
`clip`	4	13
`where`	10	105
`mask`	14	25
`combine`	0	12
`combine_first`	0	11
`isin`	86	138
`abs`	25	12
`replace`	463	216
`apply`	538	379
`round`	14	68
`transform`	10	39
`factorize`	3	15
`map`	420	91
`between`	1	12

Reduce

Object	Kaggle calls	Docs views
`cov`	0	9
`quantile`	47	78
`var`	4	11
`skew`	88	5
`std`	140	39
`sum`	898	114
`kurt`	60	1
`kurtosis`	23	3
`count`	109	107
`max`	131	70
`mean`	390	107
`median`	228	21
`min`	107	26
`mode`	205	18
`prod`	1	1
`nunique`	15	27
`all`	9	16
`any`	87	22
`mad`	3	2
`sem`	0	2
`corr`	239	105
`is_monotonic`	0	0
`is_monotonic_decreasing`	0	0
`is_monotonic_increasing`	0	0
`is_unique`	0	1
`cov`	0	9
`autocorr`	0	7
`quantile`	47	78

Misc

Object	Kaggle calls	Docs views
`iterrows`	39	102
`style`	84	76
`itertuples`	0	36
`bool`	0	5
`squeeze`	0	2
`update`	8	56
`pipe`	3	7
`__iter__`	0	1
`items`	1	6
`iteritems`	3	37
`view`	0	0

Reshape / Join / Concat...

Object	Kaggle calls	Docs views
`get_dummies`	258	152
`crosstab`	58	40
`concat`	432	315
`merge_asof`	0	16
`merge_ordered`	0	4
`wide_to_long`	0	7
`pivot`	29	95
`pivot_table`	54	144
`join`	159	225
`melt`	18	75
`stack`	0	36
`transpose`	9	76
`assign`	19	74
`insert`	17	57
`merge`	425	413
`drop`	1618	625
`explode`	0	0
`align`	3	10
`append`	439	515
`T`	55	6
`unstack`	17	58
`repeat`	0	5
`ravel`	0	5

Group

Object	Kaggle calls	Docs views
`agg`	0	16
`aggregate`	3	58
`groupby`	2267	719

Window

Object	Kaggle calls	Docs views
`cummax`	0	2
`cummin`	0	0
`cumprod`	0	5
`cumsum`	8	29
`pct_change`	0	34
`rolling`	42	140
`ewm`	0	33
`expanding`	0	11
`duplicated`	14	90
`diff`	1	54

Jul 20 '20 13:07 datapythonista

Thanks for doing this analysis!

There are a few places in the table with 0 as recorded usage, but there might be actually some calls to them. For example, the month property of the DatetimeProperties class that has some hits: https://github.com/pydata-apis/dataframe-tools/blob/master/kaggle/results/record_api_results_infer_api.json#L12914-L12924.

You can also set the PYTHON_RECORD_API_LABEL env variable to something other than pandas to like the notebook name, or some unique ID, so that you can see how many calls came from each differente notebook.

Jul 20 '20 19:07 saulshanabrook