altair
altair copied to clipboard
Improve serialization of Pandas DataFrames to ipyvega
Hi, Thanks for Altair. I have created a feature request issue for ipyvega that could also impact Altair: https://github.com/vega/ipyvega/issues/345
It boils down to creating a serializer to efficiently send a Pandas DataFrame to vega. Currently, the communication in notebooks between python and fs is very inefficient, especially with the row-wise verbose json format. It limits the amount of data that can be reasonably sent to js, and limits the visible performance of Altair.
I am interested to see if this point is important, critical, or just secondary to Altair's adoption. I think that the limitation of data size is an issue but I may be biased. Please, comment on my feature request so I can decide how to address it.
Thanks in advance, Jean-Daniel
Please follow these steps to make it more efficient to respond to your feature request.
- [X] Since Altair is a Python wrapper around the Vega-Lite visualization grammar, most feature requests should be reported directly to Vega-Lite. You can click the Action Button of your Altair chart and "Open in Vega Editor" to create a reproducible Vega-Lite example.
- [X] Search for duplicate issues.
- [X] Describe the feature's goal, motivating use cases, and its expected behavior.
More efficient data serialization would be useful, but such changes would first have to be supported in Vega-Lite.
Thanks @jdfekete for raising the issue, and also flagging @domoritz.
Scale is a recurring issue for Altair users, at least as evidenced in my visualization courses at UW. (Some students benefit from the altair data server package, but that is not a one-size-fits-all solution.) Right now the scalability experience in Observable notebooks (where the data is already in JS) is often much better than with Altair due to this serialization overhead.
While I agree with @jakevdp that more might be done in Vega itself, perhaps there is also space for handling data serialization in the generated HTML/JS prior to invoking Vega/Vega-Lite. For example, one could imagine serializing a data table to an Apache Arrow byte array in Python and then passing that instead (even if only as a base64-encoded string) to be deserialized using the Arrow JS or Arquero libraries. If so, it seems to me the costs involved would largely be (1) having to load additional JS libraries client side, and (2) format-contingent HTML/JS code generation for deserializing data before passing it to Vega.
How feasible might it be to have some kind of small plug-in system in Altair and/or ipyvega that allowed customized code for (a) serializing data on the Python side, and (b) adding library imports and deserialization code on the client side?
I absolutely agree that improving data serialization would be a huge improvement.
The way I see it, Altair is a Python API to generate Vega-Lite specs and these specs can be rendered in different platforms. Therefore, we may need to look at each of the platforms and improve serialization there.
When I was working with Streamlit, I added some code to separate the data from the chart specification so that the data can be sent as an Arrow table. You can see how I did it at https://github.com/streamlit/streamlit/blob/9714e3e6f852c26e3f8a155d39c2d5028dff1d71/lib/streamlit/elements/altair.py#L305. We could do something similar in ipyvega (https://github.com/vega/ipyvega/issues/345). I think sending the data as Arrow makes the most sense since it's columnar and even binary so e.g. floating point numbers are much more compact than as strings.
I don't think the overhead of Arrow JS in ipyvega is too large so I think we could always add it. We should measure the impact of serialization/deserialization compared to JSON to determine whether we want a flag to control whether the data is transferred as Arrow or JSON.
Closing this as there is nothing to do on the Altair side of things. See https://github.com/vega/ipyvega/pull/346 for the current progress on this feature.
I also want to point to https://vegafusion.io, which not only has efficient transport but also offloads computation to the backend making charts much more responsive.