pandas nested data infeasible operations
pandas generally discourages having nested data in a DataFrame or Series. For nested data in pandas, I tend to group the types of nested data as:
- Array-like (N-D)
In [4]: nested_array_like = pd.Series([[1, 2], [2, 3]])
In [5]: nested_array_like
Out[5]:
0 [1, 2]
1 [2, 3]
dtype: object
The only behavior somewhat defined and tested for array-like (python list specifically) is addition which acts like an append
In [14]: nested_array_like + nested_array_like
Out[14]:
0 [1, 2, 1, 2]
1 [2, 3, 2, 3]
dtype: object
And there is explode which encourages users to unnest their data
In [15]: nested_array_like.explode()
Out[15]:
0 1
0 2
1 2
1 3
dtype: object
Some operations I have seen tried by users with array-like data is:
groupbythe array-like values- element-wise operations (e.g. add 2 to each element in the array)
- reduction-wise operations per array-like value (e.g.
sumeach array) - indexing/selecting/slicing the array-like values
- containment operations (e.g. 2 in each array -> True/False)
- Key-Value-like
In [6]: nested_kv_like = pd.Series([{1:2}, {2:3}])
In [7]: nested_kv_like
Out[7]:
0 {1: 2}
1 {2: 3}
dtype: object
The only behavior supported and tested for dict-like is dict.get via the str accessor (which is somewhat strange IMO)
In [13]: nested_kv_like.str.get(1)
Out[13]:
0 2.0
1 NaN
dtype: float64
A lot of the same operations described above I've seen users try with key-value-like data except specifically treating the keys or values as "arrays"
These are certainly worthwhile processing models that we will chase. However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?
We are finding that nested/ragged data just doesn't show up a lot in python exactly because no one knows what to do with them - even though they are ubiquitous in the real world. We could probably do something interesting with the likes of https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs , for instance. We have the following specific cases in mind for examples:
- chicago taxis, which have details of the exact ride path as a sequence of lat/lon pairs
- million-songs, which has various analyses of the bars of various songs
- NYC building outline polygons
- scientific telemetry from floating buoys
Any other suggestions?
https://pythonspeed.com/articles/json-memory-streaming/ a smallish example we can directly compare to; takes 23MB for ak in memory, but a very complicated typestring.
However, I was wondering if you knew of specific datasets or workflows that people were choosing not to process with python/pandas because it was too awkward or slow?
Ah I see. Sorry I am not too familiar of public-ish datasets/workflows for this case.