lux icon indicating copy to clipboard operation
lux copied to clipboard

Reduce JSON serialization size / memory footprint

Open willeppy opened this issue 4 years ago • 4 comments
trafficstars

Problem When using lux with VL / Altair, it seems like each visualization has its own copy of the data that is then serialized through JSON to the frontend jupyter widget in lux-widget. However all of these visualizations are actually referencing the same underlying data and so multiple copies (or slices) don't need to be re-sent. This would increase load times of the widget as less data needs to be sent from the backend to front end, particularly for large datasets. This also might reduce extra memory footprint for large datasets.

Solution If one copy of the data is sent, then all the visualizations could reference it (or slices of it). This might require some re-engineering so all the Vis objects can reference the same data.

Additional Notes I think the Pandas executor already has some optimizations where each vis only has data relevant to that specific vis, but this could be reduced further. https://github.com/lux-org/lux/blob/a0cb921e6158910ecb5b061169fbe91fa0dbf0d5/lux/executor/PandasExecutor.py#L95

To see this data redundancy you can look at the JSON that is created in to_JSON and the visualizations are referencing similar slices of the data

https://github.com/lux-org/lux/blob/a0cb921e6158910ecb5b061169fbe91fa0dbf0d5/lux/core/frame.py#L704

willeppy avatar Apr 30 '21 04:04 willeppy

Hi @willeppy, Thanks for raising this issue! I have definitely noticed that the JSON serialization time is a huge bottleneck when I was profiling the code in the past. This is why we decided to process the data first inside execute so that we are only sending the processed data (e.g., pre-filtered, pre-binning, or pre-aggregated). Although for things like scatterplots where we have to send all the data in the relevant columns, it can still be slow (so we replace these with heatmaps for large data). When you describe reducing the memory footprint by referencing the same data, did you mean implementing some type of caching mechanism so that we minimize the amount of data that we have to send through JSON serialization?

dorisjlee avatar Apr 30 '21 14:04 dorisjlee

Right, if there was caching or if all the visualizations in the widget could reference the same copy of the data. Right now if there are 4 action tabs with 10 visualizations each, there are 40 different slices of the data being sent in the JSON I think (many of which overlap). Reducing these would make the JSON smaller and I think potentially reduce the widget's memory footprint as well (though Im less sure about how big the impact would be on memory)

willeppy avatar Apr 30 '21 17:04 willeppy

Yeah I think we would have to measure the exact memory footprint and expected performance improvement by implementing the caching. Labelling this as an Epic so that we can revisit this later on.

dorisjlee avatar Apr 30 '21 22:04 dorisjlee

Vega-Lite supports named data sources as an alternative to inlining the dataset. https://vega.github.io/vega-lite/docs/data.html#named

While this makes the widgets less portable by default (although you could replace the named reference with an exported reference at import time), I think this would let you send one copy of the data to the browser runtime that is added by the Javascript API, rather than hardcoding the data into the vega-lite spec itself.

hydrosquall avatar Dec 01 '22 23:12 hydrosquall