Higher performance dataframe serialization
Currently serialization of dataframes is very slow on the python side in particular. Dataframes are serialized as a list of dicts, with odd slow handling of NaN's.
by default Buckaroo downsamples so only 10k rows (unlimitted columns are serialized and sent to the frontend). For a dataframe that was 1.17 GB in memory as a baseline (5,000,000 rows total), computing stats on and displaying the first 10k rows (no downsampling, .1% of total df) took 460ms. computing stats on and displaying the first 500k rows took 891ms, the whole 5m rows took 5 seconds ... Note in all of these cases only 10k rows are serialized. From this we can tell that summary stats are generally fast, and serialization is a high constant factor.
for comparison
df[:10_000].to_numpy() -> 4ms
df.to_numpy() -> 4 seconds
df[:10_000].to_csv() -> 42ms
df.to_csv() -> lost patience
df.to_parquet('foo.parq') -> 1.6 Seconds
Off the top of my head, at around 300k rows, JS sorting in ag-grid becomes slow (+1 second)
How to speed it up?
- remove the
json.loads(df.to_json(step. build the same dict object layout in memory and let ipywidgets convert that back into to json for comms with the frontend. This step avoids some type conversion errors. 1.5x improvement off the top of my head. - make json a string property of the widget, call
JSON.parsein the frontend - move to polars for
df.to_json, off the top of my head this is 2-4x faster for the same serialization than pandas - figure out base64 serialization, based on ES6 typed arrays. Probably the fastest
- Investigate Arrow-js for binary serialization? downside is packaged sized
- polars-js? downside is packaged sized
Look at https://arrow.apache.org/docs/js/ https://github.com/vega/falcon https://github.com/pola-rs/nodejs-polars https://github.com/uwdata/arquero https://github.com/kylebarron/parquet-wasm
parquet-wasm looks best suited (per the author)
https://github.com/kylebarron/parquet-wasm
Look at this bit for ag-grid integration. https://www.ag-grid.com/react-data-grid/infinite-scrolling/