altair icon indicating copy to clipboard operation
altair copied to clipboard

FEAT: Include only referenced data columns in chart specs

Open joelostblom opened this issue 3 years ago • 10 comments

This PR takes an inelegant and conservative approach to fix #2428. It converts the chart spec (without the data) to a string and removes each of the data field names that is not found in that string (i.e. the data fields not referenced anywhere in the spec). It will not work for short columns names such as 'x', or those that have the same name as a key in the VegaLite spec, e.g. 'field', but it is still likely useful for the vast majority of cases and can reduce the chart size significantly specification with many unused columns. For example, this chart

import altair as alt
from vega_datasets import data

alt.data_transformers.disable_max_rows()
source = data.movies()

import sys
sys.getsizeof(alt.Chart(source).mark_rect().encode(
    alt.X('IMDB_Rating:Q', bin=alt.Bin(maxbins=60)),
    alt.Y('Rotten_Tomatoes_Rating:Q', bin=alt.Bin(maxbins=40)),
    alt.Color('count():Q', scale=alt.Scale(scheme='greenblue'))
).to_json())

has a size of 1873345 bytes because there is a lot of unused columns included in the data that are not in the spec:

{'data-d01e86a42596d3c55cbbce3d21a64616': [
   {'Title': 'The Land Girls',
   'US_Gross': 146083.0,
   'Worldwide_Gross': 146083.0,
   'US_DVD_Sales': None,
   'Production_Budget': 8000000.0,
   'Release_Date': 'Jun 12 1998',
   'MPAA_Rating': 'R',
   'Running_Time_min': None,
   'Distributor': 'Gramercy',
   'Source': None,
   'Major_Genre': None,
   'Creative_Type': None,
   'Director': None,
   'Rotten_Tomatoes_Rating': None,
   'IMDB_Rating': 6.1,
   'IMDB_Votes': 1071.0},
   ...

Creating the same chart with the changes in this PR reduces the chart size ~7x to 269838 bytes and successfully remove all the irrelevant columns since they have unique names not found anywhere in the spec:

{'data-d01e86a42596d3c55cbbce3d21a64616': [
  {'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
  ...

This PR seems to pass all the tests, but I still want to run some benchmarks to see if the loops introduce any significant performance regressions for larger data sets.

I also need to move the change to the schema generation file instead of the automatically generated vegalite api file and this PR is not important as some of the other open PRs, but I started playing around with it and thought I would put this up in case anyone else wants to try or have better ideas of how to achieve the same results.

joelostblom avatar Apr 02 '22 05:04 joelostblom

I've been worried about adding this sort of thing because the failure mode in which we remove a column that's needed by the chart is something that can't easily be recovered from. If we land something like this, I think it needs an "off switch" so that Altair does not become completely unusable in the event of a corner case we haven't thought of.

jakevdp avatar Apr 02 '22 16:04 jakevdp

Do you mean adding something like alt.data_transformers.disable_data_compression()? But still having the "compression" in this PR (or similar) as the default behavior?

I was trying to think of a scenario where a column could be used in Vega-Lite without the column name being explicity referenced at least somewhere in the spec, but could not imagine what that would look like. Maybe @domoritz knows if this could happen (let me know if you prefer not to be pinged like this @domoritz)?


Another think I think would be helpful to reduce the chart size would be if Vega-Lite supported data in columnar instead of records format, since this would lead to much fewer instances of the field names being repeated in the spec. So instead of the current list of dicts:

[
{'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
{'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
{'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
]

it would just be a single dict:

{
'Rotten_Tomatoes_Rating': [None, None, None],
'IMDB_Rating': [6.1, 6.1, 6.1]
},

I haven't looked closely at whether this is possible, maybe it is already and we just need to change the default in Altair.

joelostblom avatar Apr 02 '22 16:04 joelostblom

One situation I can think of is column names with special characters that may or may not be escaped when referenced in the chart spec. You could also construct column names dynamically using a Vega expression, which is probably not common but would break with this approach.

jakevdp avatar Apr 02 '22 19:04 jakevdp

Here's an example that would break with this PR, I believe:

import altair as alt
import pandas as pd

df = pd.DataFrame({'somecolumn': [1, 2, 3, 4]})

alt.Chart(df).transform_calculate(
    data='datum["some" + "column"]'
).mark_point().encode(
    x='data:Q'
)

jakevdp avatar Apr 02 '22 19:04 jakevdp

Another case could be expressions that reference the keys of a data tuple. Then we should not project either.

In general, I think projecting data to only what's really needed is great but we need to be careful to provide an escape hatch and that we consider common hard cases like the ones we already identified.

domoritz avatar Apr 03 '22 01:04 domoritz

Thanks both! I will add a toggle switch and try to account for the scenarios raised here.

@domoritz do you have an example of what a spec would look like for the issue you brought up? I can't quite visualize it myself.

joelostblom avatar Apr 03 '22 02:04 joelostblom

Even with an escape hatch, I’m not sure we’d want to land this change. Imagine the headache of trying to write some non-trivial Vega expression and finding after an hour that Altair is silently removing the column you’re trying to reference. Unless we can make this account for all cases, we probably shouldn’t land it. And accounting for all cases would essentially require re-implementing Vega in Python.

jakevdp avatar Apr 03 '22 03:04 jakevdp

Hmmm yeah that would indeed be a frustrating situation... But I also have a hard time letting go of that the vast majority of use cases would benefit from an addition like this. What do you think of making it opt-in instead of opt-out? We could add the option alt.data_transformers.enable_data_compression() to enable this behavior and when doing so a warning/info message about edge cases could be displayed?

joelostblom avatar Apr 03 '22 06:04 joelostblom

Sure, opt-in would be better, but then 99% of users would never know the feature exists and never benefit from it. I think better would be to avoid the easy broken approach and try for the full correct approach... it would basically require rolling the altair_transform package into the main altair library, expanding its completeness, and utilizing that.

A simpler option might be to only enable this only in the case that the chart contains no transformations, which I think are the only cases that create problems.

jakevdp avatar Apr 03 '22 13:04 jakevdp

I agree about the problems with discoverability. I like your suggestion of a conservative default option to only enable this compression/projection for specs that do not include transforms. This will benefit many users and we could add an optional flag to include charts with transforms and raise a warning message if this is enabled:

alt.data_transformers.compress_datasets(transformed_charts=True)
UserWarning: Data compression can break charts with transforms where data fields are referenced
without the field name occurring literally in the chart spec. Please revert this option if you are
observing unexpected behavior with transformed charts.

I updated the draft PR with a suggestion of what this could look like. I also agree that the altair_transform options sounds like the best long term solution, and this PR could be valuable in the meantime without breaking changes when/if there is a switch to the altair_transform option in the backend.

joelostblom avatar Apr 04 '22 02:04 joelostblom

For reference, removing unused columns is soon possible in a new version of VegaFusion, see discussions in #2738.

binste avatar Dec 30 '22 19:12 binste

Ah that is great, thanks for mentioning that! We can probably close this PR once we have played around with VegaFusion a bit more then. There would probably still be value in having a simple approach in altair by default, but unless we can solve the issues mentioned above, it might be unnecessarily complicated to have something in altair that works only if there are no transforms instead of just referring to the VegaFusion option for all chart size reduction needs.

joelostblom avatar Jan 03 '23 10:01 joelostblom

Let's leave this to VegaFusion if this a concern for anyone.

mattijn avatar Feb 26 '23 14:02 mattijn