altair icon indicating copy to clipboard operation
altair copied to clipboard

Don't error when hashing data that can't be serialised to JSON

Open btabram opened this issue 2 years ago • 6 comments

This is a suggestion that would help a particular issue I've got but it might be a bit niche. Thanks for looking!

Converting non-serialisable data to a string instead of erroring seems fine when it's just for generating a hash. This allows to_dict to be called when there's data that json.dumps can't serialise.

In my case I've got datetime.time instances in the dataframe I'm plotting a chart from and am using Pydantic (which can convert datetime.time to JSON) to serialise the chart dictionary.

btabram avatar Aug 22 '23 14:08 btabram

Thank you for the suggestion @btabram! You already provide a textual description how this can be of benefit for you and others, that is nice, but it would be great if you also could provide an minimal working example in code. That make this a bit easier to review. Would that be possible to provide? Thanks again!

mattijn avatar Aug 23 '23 05:08 mattijn

Thanks for the quick response @mattijn! Here's some code that works with my change but fails without it:

import pandas as pd
from altair import Chart, TopLevelMixin
from datetime import time
from pydantic import BaseModel


class Result(BaseModel):
    chart: TopLevelMixin

    class Config:
        arbitrary_types_allowed = True
        extra = "forbid"
        json_encoders = {
            TopLevelMixin: TopLevelMixin.to_dict,
        }


# Create a chart
df = pd.DataFrame({"t": [time(second=i) for i in range(10)], "value": range(10)})
chart = Chart(df).mark_point().encode(x="t", y="value")

# Now try to serialise the result so the vega chart can be rendered elsewhere
result = Result(chart=chart)
print(result.json())

In my real workflow the data is coming from an Excel spreadsheet (and pandas.read_excel) and it would be nice to be able to create charts directly from the resulting dataframe, without having to convert any data that doesn't trivially serialise 🙂

btabram avatar Aug 24 '23 07:08 btabram

Thanks for adding context.

In your case I think this should work temporarily:

with alt.data_transformers.enable(consolidate_datasets=False):
    print(result.json(encoder=str))

Related is this issue: https://github.com/altair-viz/altair/issues/2199.

And the interesting thing is, if we use other, private, functions within Altair this hash can be computed correctly:

import pandas as pd
import datetime
from altair.utils.data import _data_to_json_string, _compute_data_hash

data_date = pd.DataFrame({
    "time": [datetime.date(2020, 5, 1), datetime.date(2020, 5, 3)],
    "value": [3, 5],
})

data_time = pd.DataFrame({
    "time": [datetime.time(second=1), datetime.time(second=2)],
    "value": [3, 5],
})

json_data_date = _data_to_json_string(data_date)
json_data_time = _data_to_json_string(data_time)

_compute_data_hash(json_data_date), _compute_data_hash(json_data_time)
('156c826cd7fbaa9f799b300221f8bd31', 'a45ae23ead672db1c8c7aa63f72ac58c')

Regarding your proposed solution. I think it would be better to rework this _dataset_name function used within _consolidate_data see https://github.com/altair-viz/altair/blob/main/altair/vegalite/v5/api.py#L55-L56 by using what is done here using the approach within .to_json() function of the data model, defined here https://github.com/altair-viz/altair/blob/main/altair/utils/data.py#L192-L193. It would be good to get a verification on this of another maintainer.

A bit more debugging. Using the conventional sanitization of dataframes this works:

from altair.utils.core import sanitize_dataframe
sanitize_dataframe(data_date)

But through our new dataframe interchange protocol this errors:

from altair.utils.core import sanitize_arrow_table
import pyarrow.interchange as pi

pa_table = sanitize_arrow_table(pi.from_dataframe(data_date))
NotImplementedError: Non-string object dtypes are not supported yet

Basically the dataframe cannot be changed into an arrow table.

Another caveat is that this also means that this only can work if the type is specified in the encoding, since the automatic inference of dtypes is now done using the dataframe interchange protocol if it is available (if I remember correctly @jonmmease, see https://github.com/altair-viz/altair/blob/main/altair/utils/core.py#L589)

mattijn avatar Aug 24 '23 14:08 mattijn

Another caveat is that this also means that this only can work if the type is specified in the encoding, since the automatic inference of dtypes is now done using the dataframe interchange protocol if it is available (if I remember correctly @jonmmease, see https://github.com/altair-viz/altair/blob/main/altair/utils/core.py#L589)

That's correct, we use the DataFrame interchange protocol approach unless pyarrow is not available or pandas is too old.

I'm not clear on how we would support datetime.time values, since Vega doesn't have a time type. @btabram are you converting these times into strings for display? Or to timestamps somehow?

jonmmease avatar Aug 24 '23 15:08 jonmmease

Apologies for taking so long to reply. I can't get that workaround working for me @mattijn but it's okay because I can just convert or drop the data the doesn't serialise.

@jonmmease I'm using a javascript library to render the chart and it looked fine but actually when I started investigating I realised it was just plotting as strings and I was lucky that my points were evenly spaced. So it's not a very useful chart and this request isn't so useful either. Sorry.

However, I do still have a usecase where I'd appreciate this change, but I accept it's not a very important one so feel free to just close this. I'm often reading a DataFrame from Excel and then passing that DataFrame straight to Altair and sometimes there's a datetime.time column which I'm not even plotting but its presence causes to_dict to fail at the hashing stage. I could and probably should only pass in the DataFrame columns that I'm actually plotting to Altair but I'd appreciate the ability to pass in an arbitrary DataFrame and have to_dict succeed.

Thanks for taking the time to consder this 🙂

btabram avatar Sep 13 '23 16:09 btabram

The intended behavior of OP might be fixed by this code change: https://github.com/altair-viz/altair/pull/3377#discussion_r1535526343.

@btabram, could you have a look if Altair version 5.3 provides the behavior you are looking for?

mattijn avatar Mar 30 '24 21:03 mattijn

Hi @mattijn yeah 5.3 fixes the issue that I was having, thanks. Looks like https://github.com/altair-viz/altair/pull/3377 includes the exact change I was proposing here so I'll close this now 👍

btabram avatar Apr 02 '24 11:04 btabram