ipyvega icon indicating copy to clipboard operation
ipyvega copied to clipboard

Improve serialization of Pandas DataFrames to ipyvega

Open jdfekete opened this issue 4 years ago • 1 comments

Hi, The serialization of dataframes from python to vega in json is very inefficient, even for smallish datasets. The https://github.com/vidartf/ipydatawidgets provide a mechanism to improve the serialization of numpy arrays, which is already a step. For our project ProgressiVis, we are considering serializing a dataframe as a dictionary of columns (column-wise representation), where each column can be compressed from python and decompressed in js according to its type. At the vega level, converting a column-wise format to vega's internal format has already been done for the "arrow" format in https://github.com/vega/vega-loader-arrow so it would not be hard to do it for a dictionary of columns. In between, ipydatawidget uses gzip compression but there are other trade-offs, such as lz4. The implementation is not hard but could take a couple of weeks and it would be great to be able to reuse it to send other dataframe formats if possible (e.g. our progressive tables would use the same serialization format).

How important would that kind of optimization be for ipyvega/Altair? low-priority? high-priority? Is anyone else interested in improving the data serialization for other dataframe formats?

Best, Jean-Danel

jdfekete avatar Jun 04 '21 07:06 jdfekete

Adding to https://github.com/altair-viz/altair/issues/2471#issuecomment-854929074, I would say better serialization would be a great improvement and I am very supportive. I would suggest using Arrow as there is good support in Python and JS and more backends are adopting it as their internal representation.

domoritz avatar Jun 04 '21 18:06 domoritz

Done in #346 🎉

domoritz avatar Feb 12 '23 17:02 domoritz

Version 4.0 with this feature is released. Thanks for all your work on this.

domoritz avatar Apr 12 '23 02:04 domoritz