arrow [C++] Support for textual, JSON schema representation

Currently, Arrow has no textual representation for its schema that could serve the same purposes as JSON-Schema for JSON, the .proto files for Protobuf, etc. This issue is about adding such a text representation for an Arrow schema, to fill the same use cases that these textual representations fill for other data serialization formats.

The requirements for a text schema representation:

Data, not code (can be used without being run directly, unlike e.g. calls to the Python API to create a Schema object)
Readable by people who are experts in their field (e.g. data scientists, etc.) and are however not Arrow experts, without needing the doc side by side
Small modifications possible with no or light usage of the doc (e.g. changing a field from int32 to int64)
Writing new schemas from scratch possible with the doc for non-Arrow experts
Not tied to a particular version of Arrow & compatible between Arrow versions

And from a software engineering point of view, it would be very desirable for the implementation to not add another library dependency for Arrow (which already has many).

After discussion on the mailing list, the JSON representation for Flatbuffers data seemed the best candidate. It is a format supported by the Flatbuffers projects for serializing Flatbuffers assets in a human-readable format, for inclusion under source-control. And there is already functionality in Arrow to convert Schema objects to a Flatbuffers representation. This would meet all the requirements above, while requiring only a small amount of new Arrow code to implement.

This issue will add functions Arrow to load and save a textual, JSON representation of an Arrow schema, by first converting it to a FlatBuffers object, and then using the Flatbuffers functionality to save/load such objects as JSON.

Reporter: Christian Hudon / @chrish42

Related issues:

[Java] Support for textual JSON schema representation (was: JSON representation of pojo.Schema is incompatible with flatbuffers JSON generated via C++ API) (supercedes)

PRs and other links:

_{Note: This issue was originally created as ARROW-8952. Please see the migration documentation for further details.}

May 26 '20 18:05 asfimport

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per project policy. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

Aug 26 '22 16:08 asfimport

It appears that while there was desire to use the Flatbuffers JSON representation, most implementations do not support this:

https://flatbuffers.dev/languages/go/#text-parsing https://flatbuffers.dev/languages/java/#text-parsing https://flatbuffers.dev/languages/javascript/#text-parsing-flatbuffers-in-javascript

That would effectively make it a C++-only option which defeats the point.

Jan 20 '25 11:01 lidavidm

JSONified CData interface is an option, since the rules for the C Data interface are very clear and are guaranteed. If we're willing to make some new rules about defaults it's pretty compact:

{"format": "+s", "children": [{"format": "i", "name": "col1"}]}

I'd love this for testing!

Jan 20 '25 20:01 paleolimbot

Digging through the previous discussion:

Jacques proposed that as well: https://lists.apache.org/thread/kqrn9o14169s91n8p72gvbryzqql29d1
But I think users and Micah worried it's not very readable: https://lists.apache.org/thread/w0tnvqncz0x569c42j5t4q2mwkw3tq61

I'd guess the main stumbling block is the format string but maybe we can support a single set of aliases (both "i" and "int32" or something)

Jan 20 '25 23:01 lidavidm

It's a good point...I think there wouldn't be much debate over what the canonical mapping of format strings to aliases would be. Some thoughts:

Other open questions would be whether to support children as a mapping or allow shortening of types where everything is defaulted except the format (e.g., {"col_name": "int32"} is very nice to read!).
Not sure if anybody would complain about the "default" being non-nullable (I think it's the default in all language implementations for a field).
We probably want the flags to be verbose (e.g., ["non_nullable", "dictionary_ordered"]).

If nanoarrow counts for a second implementation I'm happy to draft an implementation 🙂

Jan 21 '25 01:01 paleolimbot

I'd guess "nullable": true would be more natural to people, so maybe also map flags to toplevel boolean fields?

children as a mapping may be convenient but I think I'd want to leave that out, at least initially (I think given that it's an object vs an array it would be straightforward to add it later)

I would say nanoarrow counts :) it might even be easiest to go with nanoarrow/go/java for the initial set and punt (for now) on C++ dependency land.

Jan 21 '25 01:01 lidavidm

I think there wouldn't be much debate over what the canonical mapping of format strings to aliases would be.

I forgot about parameterized types here...using the C Data interface definitions would avoid having to maintain a third (fourth?) way to serialize/deserialize the unit/timezone/bitwidth. Or alternatively, defer to the IPC spec and flatten everything into one object (nanoarrow's C/SchemaView and Python/Schema do this). I personally think that wrapping the Field and DataType concepts in to one object is helpful (but IPC/Integration testing JSON specifically avoid this).

Jan 21 '25 05:01 paleolimbot

Ah, parameterized types does make it annoying.

What do you mean by "flatten everything into one object"?

Jan 21 '25 06:01 lidavidm

What do you mean by "flatten everything into one object"?

In integration test JSON (which vaguely follows flatbuffers), you have:

{
  "name" : "name_of_the_field",
  "nullable" : true,
  "type" : {
    "name" : "duration",
    "unit" : "MILLISECOND"
  }
}

The C Data interface puts that all in one object conceptually:

{
  "format": "tDm",
  "name": "name_of_the_field",
  "flags": 2
}

And you could do something in the middle, although I'm not personally keen to come up with all the rules/names needed to define all these (e.g., "milliseconds"? "milis"? "ms"? "m"?).

{
  "format": "duration",
  "timeUnit": "milliseconds",
  "name": "name_of_the_field",
  "nullable": true
}

The least effort spec development and implementation-wise is (probably) a literal translation of the C data interface. It might be worth re-floating that idea with the possibility of allowing future substitutions for readability in the event anybody ends up caring about that (e.g., "format" could accept a JSON object like {"duration": {"timeUnit": "ms"}}.

Jan 21 '25 21:01 paleolimbot

IMO C Data Interface is just representing a struct as a string for expedience, so it also follows the Flatbuffers format. I think putting time unit and everything at the top level might get messy. I'd say it'd be clearest to accept basically these:

"type": "tDm",     // C Data Interface shorthand for expedience (always 
"type": "struct",  // Simple shorthand for non-parameterized types
"type": {          // Be verbose for parameterized types
  "name": "duration",
  "time_unit": "MILLISECOND"
},

Or if we want to minimize things I think we can stick with only taking the C Data Interface shorthand to start with while letting people bikeshed the user-friendly shorthands in the background.

Jan 21 '25 23:01 lidavidm

I think I would probably rather write and read "type" : {"duration": {"time_unit": "ms"}} for parameterized types and "type: "struct" for non-parameterized types (GitHub Actions yaml-configuration style), but I'm personally a bit worn out on spec negotiation which is perhaps why I'm not feeling all that motivated to go beyond something that already exists (i.e., C Data interface literal translation). If we're going to accept other things, it seems like it might be nice to accept just one of them and have it be nice to read 🙂

Jan 22 '25 20:01 paleolimbot

Fair enough. Maybe let's just go with that then. I'll see if I can prototype something when I get a chance...

Jan 23 '25 03:01 lidavidm

something that already exists (i.e., C Data interface literal translation)

Go with this, to be clear

The "other things" can always be bikeshedded by people who are invested in that :)

Jan 23 '25 03:01 lidavidm

This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.

Labelled Status: Stale-Warning for tracking.

Jun 20 '25 17:06 thisisnic

I think we should keep this as a long-term objective and to have a canonical place to handle this request, which keeps reoccuring over the years (and I suspect people simply see this and are turned away instead of commenting; I know I have received comments in private about various feature requests and have to work to convince people to publicly make the request)

Jun 21 '25 11:06 lidavidm

Sounds good, I'll remove the label. Will continue discussion on the mailing list re apporach here!

Jun 21 '25 11:06 thisisnic