Improve serialization of Pandas DataFrames to ipyvega
With @xtianpoli, we have implemented an improved serialization of Pandas DataFrames for ipyvega. It is not complete, we need to follow the rules of Altair for column type conversions, but at least, we have a noticeable speedup compared to the current version sending verbose json.
On the python side, for the VegaWidget, we have implemented a new traitlet type for a "Table". It is a dictionary of columns (a columnar representation) of a Pandas DataFrame (and potentially other tables) where we try to avoid copying anything, just point to low-level numpy arrays managed by Pandas, that can be serialized without copy using the buffer protocol. Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns. On the other side, we translate the columnar table format into Vega internal tuples, again avoiding copies when possible. Note that this serialization is only used by the streaming API since it requires using our traitlet in the VegaWidget, it cannot work inside a vega dataset.
Let us know if you disagree with our design choices. There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.
We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.
There are new notebooks to showcase the new capabilities and performances.
We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.
Looking forwards to your comments, questions, and thoughts. Best, Jean-Daniel & Christian
This pull request fixes 6 alerts when merging a8140233c3b93b5eb5e566db318a1a40feaef262 into 7cda844243f499a3017e29d0d7c8f9312e7fda83 - view on LGTM.com
fixed alerts:
- 6 for Unused import
Thank you for the pull request. I think this is a good start but I would like to discuss a few options with you.
We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.
Can you explain where these copies are happening with Arrow? When we use JSON, we already have to make a copy from pandas, no?
There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.
What do you mean you don't support multiple dataframes? I think it would be good if we could send data separately from the spec and support multiple datasets.
We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.
Better support specifically for Altair is great. Have you adopted the idea of separating data from specs I implemented in https://github.com/streamlit/streamlit/blob/96f17464a13969ecdfe31043ab8e875718eb0d10/lib/streamlit/elements/altair.py#L315?
Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.
Does this have much benefit over transparent gzip compression over HTTP? How big is the overhead for compression/decompression and the additional compies we make when we compress data?
We did not use apache-arrow as an intermediary format since it would always make a copy, and since we want to handle large datasets, we want to avoid copying them in the first place.
Can you explain where these copies are happening with Arrow? When we use JSON, we already have to make a copy from pandas, no?
No. For int and float columns, there is no copy:
import pandas as pd
from vega.widget import VegaWidget
from vega.dataframes.serializers import table_to_json
from vega.dataframes.pandas_adapter import PandasAdapter
w = VegaWidget('whatever')
df = pd.DataFrame({'a': [1,2,3], 'b': [1.5, 2.5, 3.5], 'c': ['a', 'b', 'c']})
adf = PandasAdapter(df)
table_to_json(adf, w)
{'columns': ['a', 'b', 'c'],
'data': {'a': {'shape': (3,),
'dtype': 'int32',
'buffer': <memory at 0x7f5e654b4040>},
'b': {'shape': (3,),
'dtype': 'float64',
'buffer': <memory at 0x7f5e654b4340>},
'c': {'shape': (3,), 'dtype': 'str', 'buffer': ['a', 'b', 'c']}}}
There are a few possible pitfalls, such as sending multiple DataFrames, not supported (yet). If you see a clean way to do it, let us know.
What do you mean you don't support multiple dataframes? I think it would be good if we could send data separately from the spec and support multiple datasets.
Currently, our proof of concept is based on the streaming API and we only send one dataframe at a time with the update_dataframe method. This can be extended once we agree on the underlying mechanisms.
We also provide some helpers for Altair, but we're not sure how to fully replace the standard Altair method to send data to the browser with ours. It would boil down, when a Altair-genenrated json spec is detected by the notebook, to wrap it with a VegaWidget and call update_dataframe on the Pandas DataFrame immediately after. If you can do that, then Altair would be boosted in a transparent way, able to support much larger datasets.
Better support specifically for Altair is great. Have you adopted the idea of separating data from specs I implemented in https://github.com/streamlit/streamlit/blob/96f17464a13969ecdfe31043ab8e875718eb0d10/lib/streamlit/elements/altair.py#L315?
No. Thanks for pointing to this mechanism, I will see how we can use it with our mechanism. Our examples use a similar but less flexible mechanism.
Additionally, each column can be compressed with either zip or lz4, which boosts the transfer speed of columns.
Does this have much benefit over transparent gzip compression over HTTP? How big is the overhead for compression/decompression and the additional compies we make when we compress data?
Yes it does, see: https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/ Basically, using lz4 (used by zstd) instead of gzip is faster to compress, even more to decompress, and more efficient overall. See also the page on zstd: https://facebook.github.io/zstd/
The compression "codecs" (such as lz4 or zlib) should be part of the library and pre-selected for casual users. If you know the distribution profile of your data column, a specific codec can really make a huge difference (See e.g. https://github.com/lemire/FastPFor). Standard zip compression used in HTTP is not efficient and flexible enough to accommodate data characteristics.
This pull request fixes 6 alerts when merging 283fe7572e499f446f1fc24ba969477235a1d667 into 7cda844243f499a3017e29d0d7c8f9312e7fda83 - view on LGTM.com
fixed alerts:
- 6 for Unused import
Thanks for the notes. I think I would personally still prefer Arrow since it encodes pretty efficiently and is well supported. It will make it a lot easier to maintain ipyvega in the future if we don't roll our own serialization.
Thanks for the notes. I think I would personally still prefer Arrow since it encodes pretty efficiently and is well supported. It will make it a lot easier to maintain ipyvega in the future if we don't roll our own serialization.
Ok, I'll do a different project then. But note that creating an Arrow table from a pandas dataframe is not cheap. For the 190k rows of the flights dataset, it takes 4ms to create an Arrow table compared to the 38ms required to send the pandas table serialized and compressed to the browser through the notebook. It means an overhead of 10% in time assuming the serialization is as efficient as ours, and a 100% overhead in memory. As far as I know, Arrow serialization does not compress so sending the flights table would take about 200ms when the browser is on the same machine as the python process (the same time as sending the uncompressed pandas table). For remote notebooks, the time will further increase. And still, you would need to develop and maintain the conversion and communication in ipyvega. But I agree that the multiplication of incompatible data table formats is a pain and relying on a good one is probably easier than supporting all the existing ones.
As far as I know, Arrow serialization does not compress
Arrow does support compression: https://arrow.apache.org/docs/python/generated/pyarrow.compress.html (JS support: https://issues.apache.org/jira/browse/ARROW-8674?jql=project%20%3D%20ARROW%20AND%20text%20~%20lz4).
assuming the serialization is as efficient as ours
I suspect the arrow implementation is highly optimized and may improve in the future.
The reason why I argue for arrow is that we would also immediately support other backends besides Pandas. Would it be an option for you to use arrow directly in your applications rather than pandas? Then you could avoid the copies as well.
Overall, I really like the direction you took here and would love to get better serialization added. I hope we can work together on a solution that works well for everyone.
I was also hoping we could join forces but Arrow does not use numpy/scipy that the python data science ecosystem supports efficiently. It is an alien in that world. Therefore, if I use Arrow tables, every package such as scikit-learn or tensorflow will need to convert data back and forth between Arrow and numpy, adding overhead in memory and time. Additionally, pyarrow tables are mostly static, they cannot grow or shrink. There is a strong impedance mismatch between them and the python ecosystem. Since most of my research deals with highly dynamic very large tables, pyarrow would not suit my needs.
Otherwise, Arrow might be a good option for ipyvega, except if would need pyarrow for python and arrow-js in the browser. Regarding Altair, it is currently limited to 5k rows so supporting 50-500k rows will be an improvement, and copying these smallish pandas tables to Arrow is probably acceptable.
On the other side, the python notebook community has already started to improve the serialization of numpy based structures, e.g. https://github.com/vidartf/ipydatawidgets. I can create an extension independent of ipyvega to serialize tables made of numpy columns and use them to send any data structure based on numpy, including Pandas dataframes, series, and my own tables and columns. Other notebook extensions might find it useful too. The connection to ipyvega could be done from outside ipyvega then.
If you want to support serialization from pyarrow, you would still need to implement a specific Traitlet in python and js similar but maybe simpler than my serializer. I don't know if you would need more work, but there are some specific rules to adapt pandas dataframe for serialization in Altair that you might need to follow, but maybe the pyarrow will take care of these rules by itself, I don't know.
There is a strong impedance mismatch between them and the python ecosystem.
I'm surprised to hear that since Wes started both Pandas and Arrow.
Additionally, pyarrow tables are mostly static, they cannot grow or shrink.
I think making a new table can be cheap if you batch your table.
Regarding memory usage, we could adopt a streaming serialization model so that we don't have to keep copies in memory. See https://arrow.apache.org/docs/python/ipc.html for example.
I was also hoping we could join forces
My main concern is that I have to maintain serialization and deserialization code for the coming years. We have to justify a thousand lines of extra code carefully.
No. For int and float columns, there is no copy:
I'm not following. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html seems to imply that it makes copies.
Arrow also does not make copies for some cases: https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions.
There is a strong impedance mismatch between them and the python ecosystem.
I'm surprised to hear that since Wes started both Pandas and Arrow.
I don't mean to criticize Arrow or Pandas at all, but Arrow is "universal" and therefore does not exactly match python data structures, implying copies from one to the others when working in a python world.
Additionally, pyarrow tables are mostly static, they cannot grow or shrink.
I think making a new table can be cheap if you batch your table. But the abstraction level is very low then, you need to explicitly manage the batches.
Regarding memory usage, we could adopt a streaming serialization model so that we don't have to keep copies in memory. See https://arrow.apache.org/docs/python/ipc.html for example. Copying Pandas to Arrow could be seen as reasonable since tables would not exceed 500k rows, but not copying is even cheaper.
I was also hoping we could join forces
My main concern is that I have to maintain serialization and deserialization code for the coming years. We have to justify a thousand lines of extra code carefully.
True. The heaviest operation would be close to Altair "sanitize_dataframe" function: https://github.com/altair-viz/altair/blob/1386b531529614c25904af53b031293e4f8535aa/altair/utils/core.py#L243 which has about 100 lines of code.
No. For int and float columns, there is no copy:
I'm not following. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html seems to imply that it makes copies.
Extracting columns from a Pandas DataFrame will return one Series per column with no copy. A Series holds an "array" that encapsulates the inner storage of the Series. We currently call https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy which sometimes returns a copy and sometimes not. We could use https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.array.html and dispatch according to the concrete type of NumpyExtensionArray, only creating a copy when needed. That code would need to be maintained with the evolution of Pandas. Relying on Arrow to make the conversion is easier, but I am not sure it is sufficient. When you look at what Altair does to sanitize a DataFrame, it makes some decision to transform Pandas data in a way Vega can handle it so Vega might need some ad-hoc transforms here and there anyway.
Arrow also does not make copies for some cases: https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions. Yes, once an Arrow table is created (implying a full copy of the DataFrame), then serialization is done without copying, which is great.
Again, when you look at the code of Altair "sanitize_dataframe", it creates a copy of the initial DataFrame so creating an Arrow table instead would not cost more. Then, extra code is needed to "sanitize" the table.
But for using the Streaming API, adding extra overhead to each call of "update" is expensive.
In any case, if I move the notebook extension for serializing tables in a separate project, you may decide later which route you want to use to improve the serialization performance. This decision also affects Altair so @jakevdp 's opinion would be useful.
I see. It sounds like there are benefits to the custom serialization that everyone could benefit from.
I'd be happy to merge this pull request then if you could help to finish it up with
- support for multiple datasets
- support for extracting datasets from Altair akin to the streamlit code I linked to above
- abstract interfaces for serialization/ deserialization implemented with your custom code but where we could swap in others for comparison
- unit tests and integration tests to ensure that the serialization is correct and that we don't break it in future versions
How does that sound?
I see. It sounds like there are benefits to the custom serialization that everyone could benefit from.
I'd be happy to merge this pull request then if you could help to finish it up with
- support for multiple datasets
- support for extracting datasets from Altair akin to the streamlit code I linked to above
- abstract interfaces for serialization/ deserialization implemented with your custom code but where we could swap in others for comparison
- unit tests and integration tests to ensure that the serialization is correct and that we don't break it in future versions
How does that sound?
After a discussion with Christian, we thought it would be best to split the code in two parts:
- create a ipytablewidgets project that defines a python trait for tables (such as dataframes) and widgets to serialize them. This project can be reused independently from Vega/Vega-Lite and derived for other types of python tables, such as https://www.pytables.org/ etc. This is what https://github.com/vidartf/ipydatawidgets has done for numpy arrays
- ipytablewidgets would abstract interfaces for serialization / deserialization to add others for comparison and future enhancements
- adapt ipyvega to use the ipytablewidgets internally, supporting multiple datasets etc.
- support extracting datasets from Altair akin to the streamlit code
- unit tests the ipytablewidgets, taking inspiration from the ipydatatables code
- as for unit tests and integration tests in ipyvega, we can give it a try but it might take a bit more time. Your help would be appreciated.
Would it be ok for ipyvega to depend from ipytablewidgets?
That all sounds great. In fact, there is a big advantage of separating out reusable components. There are Vega runtimes for Python notebooks not only here in ipyvega but also in nteract, Jupyter, Streamlit, etc. If we have a reusable serialization component, we can probably update these other implementations more easily.
Thank you for pushing updated to the pull request. Let me know when it's ready for a review. I'll make this as draft for now.
We're still working on the unit tests for the dependent project ipytablewidget. I'll let you know when it's done. Additionally, we planned to first support vega-lite with only one dataset before extending the VegaWidget to multiple datasets. It would work earlier. Let me know if you want to test it with all the functionalities in place, i.e., the support for multiple datasets.
I think I prefer to test it all together.
FWIW, Streamlit switched to using Arrow and it greatly reduced their code complexity: https://twitter.com/streamlit/status/1418637045468000256
Btw, Streamlit switched from their custom serialization to Arrow and got pretty good results: https://github.com/streamlit/streamlit/issues/239#issuecomment-935300084. I'm still very curious about the performance differences. If our custom serialization here is faster, they may be interested in it as well.
I am sure Arrow's serialization with compression is better than JSON's serialization without compression. There are two main issues with Arrow:
- you need to fully copy a Pandas DataFrame to an Arrow Table before serializing it. If the goal is scalability, then you will increase the memory footprint of the Python code by copying potentially large tables in Python just for serializing them. With our ipytablewidget, we try hard to avoid duplicating columns in Python by using memory views when possible.
- even if you use Arrow, you need a special code to translate the known Arrow data types into the known Vega data types. Altair implements it with it "sanitize_dataframe" function at https://github.com/vega/ipyvega/blob/master/vega/utils.py. This code needs to know the column types of Pandas and the supported types of Vega, and should be updated when any of them change. If you switch to Arrow, it is the responsibility of Arrow to do the translation from Pandas to Arrow in the best possible way, but Altair and Vega still need to convert the Arrow tables using data types that they can handle. Therefore, the effort of following the evolution of Pandas to keep Altair/Vega up-to-date with the translation becomes an effort to follow the evolutions of Arrow and its handling of Pandas data types. This might be simpler but there is still an effort involved.
- We need formal benchmarks for serialization/compression regarding time to compress, time to serialize, time to decompress. time to render, to judge the efficiency of the whole process. We have started to do these benchmarks in our ipytablewidgets, using a local notebook and a remote notebook, and have some initial results that show interesting improvements. They should be compared with Arrow, as well as the memory footprint, to have a clear view of the situation. My opinion (needing more facts) is that the compression rates will be similar between ipydatawidgets and Arrow so the only difference will be related to the conversion and copy from Pandas to "Pandas understandable by Vega" or "Pandas to Arrow". That conversion is probably faster in time than the transfer, depending on the size of the table, but will certainly use much more memory with Arrow than with ipytablewidgets.
In the end, there's a trade-off between maintainability and performance and this trade-off is a matter of opinion and not of hard numbers. But an informed opinion probably needs numbers.
Thanks for the detailed response. I agree with all your comments and look forward to the improvements. Either way, they will be significantly better than what we have right now.
This pull request introduces 3 alerts and fixes 7 when merging 4eaba72f7654b5d9cf5bf4b36f1815ad6e38d878 into 80c4e8bcbf90c027191a78c2ea7f7ad6cee1c364 - view on LGTM.com
new alerts:
- 3 for Unused import
fixed alerts:
- 6 for Unused import
- 1 for Redundant assignment
This pull request fixes 7 alerts when merging e29559709f279d3b27511a3c57da1ccf56d09472 into 80c4e8bcbf90c027191a78c2ea7f7ad6cee1c364 - view on LGTM.com
fixed alerts:
- 6 for Unused import
- 1 for Redundant assignment
Overall, really excited about this addition. I love the examples and how snappy they are.
- [ ] When running
VegaProgressivis, I getModuleNotFoundError: No module named 'progressivis'. I think you need to add the package to the poetry file. - [ ] In
AllSupportedTypesrequestsis missing. - [ ]
vega_datasetsis missing inAltairStreaming - [ ]
# The streaming API overrides the default transformer \n alt.data_transformers.enable('default')I am curious why we override the default transformer. Can't we temporarily enable a transformer when we convert Altair as I did in Streamlit and as you did for thestreamingtransform? See https://github.com/streamlit/streamlit/blob/6c7907c8d25cbeef42659f5a7dfb90597c13cff0/lib/streamlit/elements/arrow_altair.py#L368. - [ ] Remove commented out code from the example notebooks
- [ ] Add documentation for the greatly expanded API.
Thank you for the updates. I added another set of comments. This is looking good and I am confident we can merge it when you address the comments.
Let me know when this is ready for another review pass. Looking forward to making this available to our users.
This pull request fixes 1 alert when merging f655f5d635dc6f9421b89ca2e1a04585d6bc7622 into 53b0516a25697a33206ce64f65b093a3b10f270a - view on LGTM.com
fixed alerts:
- 1 for Redundant assignment
We have updated the PR according to your questions and comments.
- Altair is now a dependency, used for tests and in two notebooks.
- The autoresize property is not enough, for several cases, you need to explicitly call
resize()to the VegaEmbed object. The VegaStreaming notebook contains a list of examples from the vega_datasets that will not display correctly without callingresize(). This is the default. - the
touchargument has been removed - vega/altair.py has been simplified, the code to visualize all the examples has been moved to the StreamingAltair notebook so it is more visible and people can play with it if they want to better understand the use of the streaming API
If Altair is only a dependency for tests and a notebook, it should be a dev dependency only, right?
If Altair is only a dependency for tests and a notebook, it should be a dev dependency only, right?
Probably, yes.
This pull request fixes 1 alert when merging ec18ab75e14d0c36697680c544a871f49f952477 into 7241fa3407eb71a27e35dffc2d9170ca6cde797a - view on LGTM.com
fixed alerts:
- 1 for Redundant assignment