[C++] Support for textual, JSON schema representation
Currently, Arrow has no textual representation for its schema that could serve the same purposes as JSON-Schema for JSON, the .proto files for Protobuf, etc. This issue is about adding such a text representation for an Arrow schema, to fill the same use cases that these textual representations fill for other data serialization formats.
The requirements for a text schema representation:
-
Data, not code (can be used without being run directly, unlike e.g. calls to the Python API to create a Schema object)
-
Readable by people who are experts in their field (e.g. data scientists, etc.) and are however not Arrow experts, without needing the doc side by side
-
Small modifications possible with no or light usage of the doc (e.g. changing a field from int32 to int64)
-
Writing new schemas from scratch possible with the doc for non-Arrow experts
-
Not tied to a particular version of Arrow & compatible between Arrow versions
And from a software engineering point of view, it would be very desirable for the implementation to not add another library dependency for Arrow (which already has many).
After discussion on the mailing list, the JSON representation for Flatbuffers data seemed the best candidate. It is a format supported by the Flatbuffers projects for serializing Flatbuffers assets in a human-readable format, for inclusion under source-control. And there is already functionality in Arrow to convert Schema objects to a Flatbuffers representation. This would meet all the requirements above, while requiring only a small amount of new Arrow code to implement.
This issue will add functions Arrow to load and save a textual, JSON representation of an Arrow schema, by first converting it to a FlatBuffers object, and then using the Flatbuffers functionality to save/load such objects as JSON.
Reporter: Christian Hudon / @chrish42
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-8952. Please see the migration documentation for further details.
Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.
It appears that while there was desire to use the Flatbuffers JSON representation, most implementations do not support this:
https://flatbuffers.dev/languages/go/#text-parsing https://flatbuffers.dev/languages/java/#text-parsing https://flatbuffers.dev/languages/javascript/#text-parsing-flatbuffers-in-javascript
That would effectively make it a C++-only option which defeats the point.
JSONified CData interface is an option, since the rules for the C Data interface are very clear and are guaranteed. If we're willing to make some new rules about defaults it's pretty compact:
{"format": "+s", "children": [{"format": "i", "name": "col1"}]}
I'd love this for testing!
Digging through the previous discussion:
- Jacques proposed that as well: https://lists.apache.org/thread/kqrn9o14169s91n8p72gvbryzqql29d1
- But I think users and Micah worried it's not very readable: https://lists.apache.org/thread/w0tnvqncz0x569c42j5t4q2mwkw3tq61
I'd guess the main stumbling block is the format string but maybe we can support a single set of aliases (both "i" and "int32" or something)
It's a good point...I think there wouldn't be much debate over what the canonical mapping of format strings to aliases would be. Some thoughts:
- Other open questions would be whether to support
childrenas a mapping or allow shortening of types where everything is defaulted except theformat(e.g.,{"col_name": "int32"}is very nice to read!). - Not sure if anybody would complain about the "default" being non-nullable (I think it's the default in all language implementations for a field).
- We probably want the flags to be verbose (e.g.,
["non_nullable", "dictionary_ordered"]).
If nanoarrow counts for a second implementation I'm happy to draft an implementation 🙂
I'd guess "nullable": true would be more natural to people, so maybe also map flags to toplevel boolean fields?
children as a mapping may be convenient but I think I'd want to leave that out, at least initially (I think given that it's an object vs an array it would be straightforward to add it later)
I would say nanoarrow counts :) it might even be easiest to go with nanoarrow/go/java for the initial set and punt (for now) on C++ dependency land.
I think there wouldn't be much debate over what the canonical mapping of format strings to aliases would be.
I forgot about parameterized types here...using the C Data interface definitions would avoid having to maintain a third (fourth?) way to serialize/deserialize the unit/timezone/bitwidth. Or alternatively, defer to the IPC spec and flatten everything into one object (nanoarrow's C/SchemaView and Python/Schema do this). I personally think that wrapping the Field and DataType concepts in to one object is helpful (but IPC/Integration testing JSON specifically avoid this).
Ah, parameterized types does make it annoying.
What do you mean by "flatten everything into one object"?
What do you mean by "flatten everything into one object"?
In integration test JSON (which vaguely follows flatbuffers), you have:
{
"name" : "name_of_the_field",
"nullable" : true,
"type" : {
"name" : "duration",
"unit" : "MILLISECOND"
}
}
The C Data interface puts that all in one object conceptually:
{
"format": "tDm",
"name": "name_of_the_field",
"flags": 2
}
And you could do something in the middle, although I'm not personally keen to come up with all the rules/names needed to define all these (e.g., "milliseconds"? "milis"? "ms"? "m"?).
{
"format": "duration",
"timeUnit": "milliseconds",
"name": "name_of_the_field",
"nullable": true
}
The least effort spec development and implementation-wise is (probably) a literal translation of the C data interface. It might be worth re-floating that idea with the possibility of allowing future substitutions for readability in the event anybody ends up caring about that (e.g., "format" could accept a JSON object like {"duration": {"timeUnit": "ms"}}.
IMO C Data Interface is just representing a struct as a string for expedience, so it also follows the Flatbuffers format. I think putting time unit and everything at the top level might get messy. I'd say it'd be clearest to accept basically these:
"type": "tDm", // C Data Interface shorthand for expedience (always
"type": "struct", // Simple shorthand for non-parameterized types
"type": { // Be verbose for parameterized types
"name": "duration",
"time_unit": "MILLISECOND"
},
Or if we want to minimize things I think we can stick with only taking the C Data Interface shorthand to start with while letting people bikeshed the user-friendly shorthands in the background.
I think I would probably rather write and read "type" : {"duration": {"time_unit": "ms"}} for parameterized types and "type: "struct" for non-parameterized types (GitHub Actions yaml-configuration style), but I'm personally a bit worn out on spec negotiation which is perhaps why I'm not feeling all that motivated to go beyond something that already exists (i.e., C Data interface literal translation). If we're going to accept other things, it seems like it might be nice to accept just one of them and have it be nice to read 🙂
Fair enough. Maybe let's just go with that then. I'll see if I can prototype something when I get a chance...
something that already exists (i.e., C Data interface literal translation)
Go with this, to be clear
The "other things" can always be bikeshedded by people who are invested in that :)
This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.
Labelled Status: Stale-Warning for tracking.
I think we should keep this as a long-term objective and to have a canonical place to handle this request, which keeps reoccuring over the years (and I suspect people simply see this and are turned away instead of commenting; I know I have received comments in private about various feature requests and have to work to convince people to publicly make the request)
Sounds good, I'll remove the label. Will continue discussion on the mailing list re apporach here!