[Docs] List the logical types in Columnar.rst for searchability
Describe the enhancement requested
See https://github.com/apache/arrow-nanoarrow/pull/74#pullrequestreview-1195958840
A few comments...it may be worth following this up with a section on the Map in https://arrow.apache.org/docs/format/Columnar.html ...other than a footnote in the C Data interface spec, I can't find any reference to the memory layout for a Map (or for the datetime interval types, which I remember having to look up in the C++ implementation)
Columnar.rst doesn't describe the logical types, since Schema.fbs is considered authoritative. But it is probably worth at least listing the types in this document so that they can be easily searched for (and possibly even summarize those types).
Component(s)
Documentation
Thanks for opening this! In particular David's comment:
Each list item represents a dictionary, where the Struct values of each list entry are key-value pairs.
was particularly helpful.
The other types that I could not find in the columnar spec when I was looking for them were the interval types, which might be worth mentioning.
Interval types are also logical types, so yeah, a listing of all the logical types might be useful.
Arrow has no notion of logical types. But, yes, making the Columnar format spec more readable would be useful.
Ok, unfortunately, Columnar.rst does use the wording "logical type". Which is contradicted by the fact that there's no separate set of "physical types" (only layouts). The whole thing has always been confusing to me.
I think it would be nice if all the types were in Columnar.rst (with the corresponding layouts and any parameters). These don't change frequently and so I don't think that maintaining a sync between the .fbs file and the documentation will be prohibitively complicated?
I agree it would be nice, at least as a synthetic table.
Semi-related issue: https://github.com/apache/arrow/issues/33958
I'd like to see this part of the format docs improved and would be happy to submit a PR for review. I read the comments above and in other issues and it seems like there's:
- Still some discussion to be had about avoiding "logical" vs. "physical" in favor of "types" and "layouts" and possibly updating the format docs comprehensively
- Concensus around listing all types (ie everything in https://github.com/apache/arrow/blob/505a2e4519ba8d4983da6caa1c56ecae8d063e84/format/Schema.fbs#L407C5-L430) in a table with columns for layouts and parameters, where appropriate.
I could start with a PR for (2). Looking at the high level sections in https://arrow.apache.org/docs/format/Columnar.html, I think a comprehensive table of types would be best right after Terminology and before Physical Memory Layout as I think people would generally want to know the types before their physical layouts. Does this sound reasonable?
I would be excited to see that PR! I think that a type listing at the start ("this is what Arrow can do") followed by layouts ("this is what it looks like in memory") makes a lot of sense and would have helped me a lot when I was trying to implement them. I like types + layouts rather than logical + physical but I don't have strong feelings about it as long as it's consistent.
The "logical" vs. "physical" distinction is actually extra confusing, because nowadays we do have logical types (aka semantic variations of existing types), but they are called... extension types.
Gotcha. That brings up another point... should the newly-added Tensor types be in the aforementioned table of Types. I'd think yes.
I don't think so. They're extension types, not part of the columnar spec itself. You may instead add a seealso after the table to point to https://arrow.apache.org/docs/format/CanonicalExtensions.html
Okay, that makes sense.
- Still some discussion to be had about avoiding "logical" vs. "physical" in favor of "types" and "layouts" and possibly updating the format docs comprehensively
I opened a dedicated issue for this, as it requires a more comprehensive update of the docs than just adding a table of all (logical) type: https://github.com/apache/arrow/issues/41691
This was fixed in #41958, after inspiration from @AlenkaF 's own PR. Thanks!