typing Define a JSON type

JSON is such a common interchange format it might make sense to define it as a specific type.

JSON = t.Union[str, int, float, bool, None, t.Mapping[str, 'JSON'], t.List['JSON']]

Not sure if this should go into typing or be introduces as json.JSONType instead (or if it's even worth it considering the variability of the type).

Feb 19 '16 00:02 brettcannon

I tried to do that but a recursive type alias doesn't work in mypy right now, and I'm not sure how to make it work. In the mean time I use JsonDict = Dict[str, Any] (which is not very useful but at least clarifies that the keys are strings), and Any for places where a more general JSON type is expected.

(I'm sure you meant t.Mapping[str, 'JSON'].)

Feb 19 '16 00:02 gvanrossum

You are right about what I meant and I fixed my comment to not confuse anyone in the future.

And I understand about the lack of recursive object support.

Would this be a better definition?

JSONValue = t.Union[str, int, float, bool, None, t.Dict[str, t.Any], t.List[t.Any]]
JSONType = t.Union[t.Dict[str, JSONValue], t.List[JSONValue]]

If you read RFC 4627 it says a JSON object must be an object or array at the top level (RFC 7159 loosens that to any JSON value, but it's not an accepted standard). If you want to play it safe with what json.loads() takes, then you can just flatten it to:

JSONType = t.Union[str, int, float, bool, None, t.Dict[str, t.Any], t.List[t.Any]]

I guess the real question is how far you want to take this, because if you assume that most JSON objects only go, e.g. 4 levels deeps, you could handcraft accuracy to that level:

_JSONType_0 = t.Union[str, int, float, bool, None, t.Dict[str, t.Any], t.List[t.Any]]
_JSONType_1 = t.Union[str, int, float, bool, None, t.Dict[str, _JSONType_0], t.List[_JSONType_0]]
_JSONType_2 = t.Union[str, int, float, bool, None, t.Dict[str, _JSONType_1], t.List[_JSONType_1]]
_JSONType_3 = t.Union[str, int, float, bool, None, t.Dict[str, _JSONType_2], t.List[_JSONType_2]]
JSONType =  t.Union[str, int, float, bool, None, t.Dict[str, _JSONType_3], t.List[_JSONType_3]]

But then again the union of those objects is pretty broad so this might not be the most useful type hint. :)

Feb 19 '16 17:02 brettcannon

I guess that'll work, but I'm not convinced that it's very useful to do the multiple levels.

The next question is where this would live? Would you add it to typing.py? Or to the json module? It would have to be added both to json.pyi as well to the actual implementation module. :-(

Feb 20 '16 01:02 gvanrossum

OK, so JSONType = t.Union[str, int, float, bool, None, t.Dict[str, t.Any], t.List[t.Any]] seems to be the best solution.

As for where it should go, I don't have a good answer unfortunately. It's like the collections.abc issue; do we have a single place to keep all types -- i.e., typing -- or do we keep types in the specific modules that they relate to (i.e., json in this case)? I guess this would be the first type that wasn't a generic container if we add it to the stdlib somewhere, so there is no precedent to go by.

If we put it in typing then at least all types related to the stdlib are in a single location which is handy for only having to do import typing as t to get at all types. Unfortunately that doesn't work for for third-party libraries so it doesn't seem like the best way to go. So I guess my suggestion is the json module should house it and keep the name JSONType for the module attribute. If you agree I will open an issue on bugs.python.org to add the type to Python 3.6 and then also an accompanying issue for https://github.com/python/typeshed to add a json.pyi and you can close this issue. Otherwise I'll submit a PR to add the type to typing.

Feb 20 '16 21:02 brettcannon

I think it's best to add it to the json module; we can't keep adding everything to typing.py (even the io and re types there are already questionable). Code that wants to use these in Python 3.5 or earlier can write

if False:
    from json import JSONType

(Or they can copy the definition into their own code.)

Question: should we name it JsonType or JSONType? There doesn't seem to be a strong convention here in the Python stdlib -- we have XmlListener and HTMLParser... But somehow I am beginning to prefer AcronymsAreWords.

Feb 21 '16 00:02 gvanrossum

Actually, since the json module consistently uses JSONWhatevs, it should be JSONType.

Feb 21 '16 00:02 gvanrossum

PEP 8 says to capitalize abbreviations.

Open an issue for the stdlib at http://bugs.python.org/issue26396 and one for typeshed at python/typeshed#84.

Feb 21 '16 00:02 brettcannon

I'm marking this for 3.5.2 so we at least have the discussion. I'm still not sure of the solution -- add it to typeshed/.../json.pyi or to typing.py? It can't appear in the stdlib json module until 3.6 but it could appear in typing.py in 3.5.2 (since typing.py is provisional), but I'm not excited about pushing everything to typing.py. So maybe adding it to typeshed/**/json.pyi now and the stdlib json module in 3.6 would be best? If you want to use it then you'd have to write if false: from json import JsonObject.

(I've got a feeling I'm just summarizing where we ended up before but I'm currently in the mood to make the definitive list of things to discuss and/or implement before 3.5.2.)

Mar 18 '16 19:03 gvanrossum

I'm not a fan of having a "partial" JSON type that soon degenerates into Any. Type checkers would enforce things inconsistently. As soon as you descent into a JSON object you'd have to manually annotate the result to get type checking for the component object, and this would be hard to do consistently.

Having a recursive JSON type seems like a better idea to me, but even then I'd like to see how the type works for real-world code before including it in the PEP. I suspect that the majority of code doing JSON parsing actually doesn't perform enough isinstance checks when manipulating JSON objects for the code to type check cleanly. I wouldn't like PEP 484 to require programmers to jump through hoops to get their code to type check. For example, just today I reviewed some JSON parsing code that does not perform enough checks to pass type checking if it had used a strict type for JSON, but I think that the code was fine (@gvanrossum do you recognize what code I'm talking about?) :-)

Anyway, if programmers want to use such as partial type, they can define the alias and use it in their code even without making it official, though they may have to introduce some explicit type annotations when interacting with library code that doesn't use the type.

Mar 18 '16 21:03 JukkaL

Two problems with adding a recursive JSON type to the PEP:

IIRC Brett and I tried and failed to come up with a recursive definition that worked in mypy
I don't believe JSON is special enough to deserve a place in the PEP or typing (re and io are borderline but they are way more fundamental than JSON)

Mar 18 '16 21:03 gvanrossum

The summary @gvanrossum gave of where things left off was accurate. Didn't come up with a recursive type that could work.

In response to @JukkaL about usefulness, I view it as useful for specifying what json.load() returns, not what json.dump() accepts. This is what I came across in my own code when I was trying to do proper type hinting but didn't have a way better than Any to express an method parameter that was accepting a JSON object that was received from GitHub's API.

Mar 18 '16 23:03 brettcannon

Somehow this was closed but we don't even have consensus!

Mar 21 '16 17:03 gvanrossum

I forgot to confirm that mypy doesn't support the kinds of recursive types discussed above, and there are no concrete plans to implement them.

@brettcannon I agree that JSON values are common in programs, but I'm not convinced that having a precise type would make it easy to type check common code that processes JSON data, because before accessing any value read from a JSON object, the code needs to use isinstance to narrow down the type from the union (assuming precise type checking of union types, similar to mypy). Most code I've seen is sloppy about this. Some code could be argued to be broken, but it's also possible that there is a top-level try/except statement that handles all errors, so the code might actually mostly do the right thing. (I can find an example if you are unsure about what I mean.) Also, it's possible that the code first verifies the entire JSON data structure and then accesses it, and the latter assumes that it has the correct structure. In the latter case a structural "dictionary-as-struct" type and an explicit cast might be best.

As there are many valid ways of processing JSON data, I think that Any is a reasonable default for the library stubs. User code could then use whatever static type for JSON data they want by adding an explicit type annotation. Thus I argue that it's not a good idea to make json.load() return a statically typed value. json.dump() is a little different and there a static argument type might make sense, but we don't have the means to describe the type of the argument in a useful enough way right now.

In order to describe types of JSON values precisely, these features would be useful:

General recursive types -- for arbitrary JSON values
"Dictionary-as-struct" types (#28) -- for JSON values conforming to a particular schema

(I started writing a proposal for (2) a while ago but got distracted.)

Neither are currently defined in PEP 484. The first one would be easy to specify but potentially tricky to implement. The latter would be tricky to specify and implement, and all the use cases are not clear to me yet. I suggest that we wait until a tool implements one or the other and then we can continue this discussion at a more concrete level.

Mar 21 '16 21:03 JukkaL

I am very tempted to drop this idea. In my own code I use this:

JsonDict = Dict[str, Any]

which happens to cover perfectly what I'm doing (even though it sounds like Jukka has found some holes :-).

That really doesn't reach the threshold for adding it to a stub file to me (the definition exists in exactly two files).

Mar 21 '16 23:03 gvanrossum

I'm fine with closing this if recursive types aren't in the pipeline. Obviously the generic JSON type Guido and I came up with is not a very tight definition and so is of limited value. If the preference is only to worry about tight-fitting type boundaries then this doesn't really make sense.

Mar 22 '16 15:03 brettcannon

OK, I'm closing this, because of a combination of things:

We can't define JsonObject in a very tight way
It's a simple one-liner to define a suitable JsonObject in your own code
It's difficult to roll out the change in a useful way because we can't add it to the stdlib json module until 3.6 and it really doesn't belong in typing.py

Maybe we can just add the non-tight version to 3.6 and worry about tightening it up if/when we ever implement recursive types.

Mar 22 '16 15:03 gvanrossum

I suggest that even if we add the type to the module, we wouldn't use it as the return type of the load functions, at least for now (I discussed this above).

Mar 22 '16 16:03 JukkaL

That's a subtlety I missed. Why is def json.load(stream) -> Any better than def json.load(stream) -> JSONType? Where would users add an explicit annotation that's allowed if it returns Any but not if it returns JSONType? Or are you talking about the situation where object_hook is used and the return value may in fact contain other values than the ones mentioned in Brett's union? That's indeed a good point (though one only appreciated by the few users who actually use that hook).

Mar 22 '16 16:03 gvanrossum

If it returns JSONType then the first thing any code needs to do is to run an isinstance check for the returned value, as due to being a union, most operations won't be valid on it. However, in some cases it's arguably okay to assume that the returned value is a dict, for example. If this an internal data file, we can be reasonably sure that the format is what we expect. I didn't think about object_hook but that might be another thing to consider.

If a user wants to do the type check they can add an annotation if the return type is Any:

data = json.load(...)  # type: JSONType
if isinstance(data, dict): 
    ...

This, for example, would be rejected if the return type is a union, but would be fine if the return type is Any:

data = json.load(...)  # type: Dict[str, Any]   # error if load() return type is union
...

Mar 22 '16 17:03 JukkaL

OK. you've convinced me that def dump(fp) -> JSONType i a bad idea. I guess def load (fp, obj: JSONType) is still acceptable except for the hook -- but because of the hook we can't use it there either. Maybe we should just leave well enough alone. @brettcannon?

Mar 22 '16 17:03 gvanrossum

I'm fine with tossing this whole idea out. I wasn't sure if typing was trying to have something all the time when someone was willing to specify the type or to only when the type matching was tight. It seems like the latter is how you want to treat types which is fine and makes something as variable and loose as JSON not worth worrying about.

Mar 22 '16 19:03 brettcannon

I think a key requirement is that stubs should not reject code that is in fact correct. Better to accept code that's wrong. Unless of course the correct code is very convoluted, but I don't think that using the hook qualifies as convoluted, and there's just too much code around that reads JSON code and dives in as if it knows what's there, accepting random TypeErrors if the JSON data is wrong.

Mar 22 '16 19:03 gvanrossum

@gvanrossum You said that you use JsonDict = Dict[str, Any], but how to be if json has similar structure:

[
    {...},
    {...},
]

Is it correct?

from typing import Dict, List, Union, Any

JSONType = Union[
    Dict[str, Any],
    List[dict, Any],
]

Oct 22 '17 11:10 lk-geimfari

@lk-geimfari Just use the definition in the original post and either replace the nested JSON with Any or just add a # type: ignore at the definition line (mypy doesn't fully support recursive types yet, but it will automatically truncate recursive types expansion with Anys).

Oct 22 '17 15:10 ilevkivskyi

@ilevkivskyi I understand. Thanks!

Oct 22 '17 15:10 lk-geimfari

"there's just too much code around that reads JSON code and dives in as if it knows what's there, accepting random TypeErrors if the JSON data is wrong"

That is an awful situation, but then, JSON is intended to be semi-structured, so code that accepts it should be lenient in handling it. As a data engineer who wastes many hours dealing with loose & changing JSON (and XML), I see huge value in data producers using strict schemas to both validate the data and provide separately along with the data. It never happens though. XML schema is useful like that, but most people don't bother generating them when producing XML.

There's so much in that statement. Code that parses JSON usually has to be gung-ho and make assumptions about what is there because there is usually no provided schema. Alternatively you have to generate a schema by parsing it, just so you can then re-read it correctly. You can infer schema trivially by parsing a full document, but when you have millions of documents to process that vary, you end up sampling the documents and trying to generate a generic schema that fits the full set of documents. Apache Spark has json infer schema method that do just that, but it's an expensive process, and wasteful in my opinion when it could be done up front. I'd probably also be out of a job but hey that would be a great situation. If the JSON data is semi-structured, and no schema is defined or used for validation, then other than syntax errors the JSON data can't be strictly wrong, it is morphable and has to be dealt with loosely.

There are some other libraries that attempt to address this, and the problem is probably much more pandemic than can just be addressed within Python. It's a multi-language interchange format so some more formal schema validation outside of the typing library is probably where is belongs.

Here's one attempt at it (not used it, don't know if it's useful): https://github.com/Julian/jsonschema/