altair icon indicating copy to clipboard operation
altair copied to clipboard

Altair produces syntactically incorrect JSON output that contains NaN token

Open sacundim opened this issue 5 years ago • 6 comments

See line #111 in this Gist, where Altair produced output with a NaN token that is not syntactically valid JSON. The library will happily produce a JSON file like that, that then is prone to causing really opaque errors, for example in this altair_saver issue.

The way I got into this mess: I set a domain on an axis using a value that I computed from the data it visualizes but it turns out there's an edge case where my calculation can produce NaN. Ooops.

It occurs to me that the library should have a validation step to catch this sooner and produce a more useful error message.

sacundim avatar Sep 27 '20 07:09 sacundim

Can you provide a short reproducible example of code that produces invalid output?

jakevdp avatar Sep 27 '20 14:09 jakevdp

import altair as alt
import pandas as pd

source = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52]
})

bad = alt.Chart(source).mark_bar().encode(
    x='a',
    y=alt.Y('b', scale=alt.Scale(domain=[0, float('Nan')]))
)

bad.save('bad.json')

When I look at bad.json (pasted below) I see JSON output with "scale": {"domain": [0, NaN]}. I pasted it into the online Vega Editor and it gives [Error] Unexpected token N in JSON at position 267 and line 1, and when I click on the "Share" option (to get a friendly link to the example inside the editor), the editor crashes.

I tried to pretty-print this JSON output to make it friendlier, but jq actually parses the NaN and produces null in the output:

{"config": {"view": {"continuousWidth": 400, "continuousHeight": 300}}, "data": {"name": "data-c2a3e89ba9d5d1687d5e8c28d630a033"}, "mark": "bar", "encoding": {"x": {"type": "nominal", "field": "a"}, "y": {"type": "quantitative", "field": "b", "scale": {"domain": [0, NaN]}}}, "$schema": "https://vega.github.io/schema/vega-lite/v4.8.1.json", "datasets": {"data-c2a3e89ba9d5d1687d5e8c28d630a033": [{"a": "A", "b": 28}, {"a": "B", "b": 55}, {"a": "C", "b": 43}, {"a": "D", "b": 91}, {"a": "E", "b": 81}, {"a": "F", "b": 53}, {"a": "G", "b": 19}, {"a": "H", "b": 87}, {"a": "I", "b": 52}]}}

sacundim avatar Sep 27 '20 18:09 sacundim

Altair relies on Python's built-in json module to serialize user input to JSON. It seems the bug is in the Python standard library:

>>> import json
>>>json.dumps({'x': float('nan')})
{"x": NaN}

jakevdp avatar Sep 27 '20 22:09 jakevdp

Ah, maybe we should be using something like allow_nan=False:

>>> json.dumps({'x': float('nan')}, allow_nan=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-62e31c5f8250> in <module>()
      1 import json
----> 2 json.dumps({'x': float('nan')}, allow_nan=False)

2 frames
/usr/lib/python3.6/json/encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

ValueError: Out of range float values are not JSON compliant

jakevdp avatar Sep 27 '20 22:09 jakevdp

Using an allow_nan-False option sounds liken essential failsafe. The only (generic) question I'd pose myself is if higher-level validation that gives friendlier error messages for the library user (e.g., indicate where the NaN was found) is worth the effort for what's likely an infrequent failure mode.

sacundim avatar Sep 29 '20 08:09 sacundim

FYI this is still an issue, I just got bitten by this today. One workaround is to preprocess the input DataFrame with .replace({np.nan: None}) to turn NaNs into None which then get translated into valid JSON null values.

dechamps avatar Oct 29 '23 17:10 dechamps

https://docs.python.org/3.13/library/json.html#standard-compliance-and-interoperability

This module does not comply with the RFC in a strict fashion, implementing some extensions that are valid JavaScript but not valid JSON. In particular:

Infinite and NaN number values are accepted and output;

Repeated names within an object are accepted, and only the value of the last name-value pair is used.

Since the RFC permits RFC-compliant parsers to accept input texts that are not RFC-compliant, this module’s deserializer is technically RFC-compliant under default settings.

I'm closing this as not planned. Changing the default will inevitably break others' code and there are at least 3 ways to workaround this:

  1. Clean and validate your data before visualizing it
  2. Use something like alt.FieldValidPredicate or alt.expr.isNan while vizualizing
  3. Change the call bad.save('bad.json') -> bad.save("bad.json", json_kwds={"allow_nan": False})

dangotbanned avatar Jan 11 '25 17:01 dangotbanned