arrow ARROW-17289: [C++] Add type category membership checks

See https://issues.apache.org/jira/browse/ARROW-17289

Aug 02 '22 20:08 rtpsw

https://issues.apache.org/jira/browse/ARROW-17289

Aug 02 '22 20:08 github-actions[bot]

cc @icexelloss, @westonpace

Aug 02 '22 21:08 rtpsw

It sounds like it would be better to document the existing predicates, rather than add new ones with different names.

Aug 03 '22 12:08 pitrou

It sounds like it would be better to document the existing predicates, rather than add new ones with different names.

The existing set of predicates is both confusing and incomplete, so I'd argue for both documenting and adding predicates. I'm not sure what to do with the names, considering backwards compatibility, though. Suppose we agree that a predicate corresponding to PrimitiveTypes should be added, since is_primitive is not the one, what should be the name of this predicate? There should be some naming convention here; perhaps *Types() should have a corresponding check_*(type) predicate.

Aug 03 '22 18:08 rtpsw

I added docs that clarify the meaning of and correspondence between *Types() and Is*Type() functions.

Aug 04 '22 09:08 rtpsw

The existing set of predicates is both confusing and incomplete, so I'd argue for both documenting and adding predicates

Hmm... can you point out the causes for confusion? Perhaps there's a way to improve that.

Same for incompleteness: is the set of Type::type predicates incomplete? if so, we should add the missing and desired predicates.

Suppose we agree that a predicate corresponding to PrimitiveTypes should be added, since is_primitive is not the one

Why is is_primitive not the one? It seems like it is returning the desired information, is it not? (is it missing some types perhaps?)

Aug 04 '22 09:08 pitrou

Hmm... can you point out the causes for confusion? Perhaps there's a way to improve that.

Same for incompleteness: is the set of Type::type predicates incomplete? if so, we should add the missing and desired predicates.

Sure. Check out this commit:

is_numeric is missing and I added it.
is_primitive exists but does not correspond to PrimitiveTypes(), which is confusing. I added is_primitive_like which does, though I'm not sold on the name.
I added the missing is_binary, is_string, is_temporal, and is_interval.
There are other is_*() primitives that do not correspond to any *Types() function - is_base_binary_like is one example.

So, part of the confusion is the unexpected non-correspondence of is_primitive and part is due to missing and extra predicates with respect to the set of *Types() functions. In this PR, I extended the set of is_*() predicates to cover the correspondences and added Is*Type() predicates that correspond to *Types() functions as expected, while maintaining backwards-compatibility, and I added docs to clarify the meaning and correspondences.

Why is is_primitive not the one? It seems like it is returning the desired information, is it not? (is it missing some types perhaps?)

As compared to PrimitiveTypes() (and is_primitive_like() that I added), is_primitive is missing NA, BINARY, STRING, LARGE_BINARY and LARGE_STRING while it adds DURATION, INTERVAL_MONTHS, INTERVAL_MONTH_DAY_NANO, and INTERVAL_DAY_TIME.

Aug 04 '22 10:08 rtpsw

Ok, so there are two different areas here:

complete and/or fix the set of predicates operating on a Type::type parameter
add a separate set of predicates operating on a const DataType& parameter

I think the former (case number 1 above) is the most interesting to solve, because it directly affects readability.

Would you like to defer number 2 to a separate JIRA (and potentially PR, if people agree it's a worthwhile thing to do)?

Aug 04 '22 12:08 pitrou

Also, for the record, functions such as PrimitiveTypes() were added for testing originally, and they were exposed publicly because there didn't seem to be a strong reason against it (the main use case is to be able to write testing loops or parameterized tests e.g. for (const auto& type : PrimitiveTypes()) ...).

Aug 04 '22 12:08 pitrou

Now to answer the problems with the current predicates:

+1 to adding is_numeric, is_binary, is_string, is_temporal and is_interval
for is_primitive: is "primitive" any type that's not nested nested/compound; whether or Type::NA should be included is a bit contentious, but otherwise all temporals, numerics (including decimals) and binary-likes should probably be included
I think is_primitive_like should probably be avoided; granted, we have other is_*_like functions, but they turn out to be confusing, so we should probably not reuse that pattern
therefore I would be in favor of changing is_primitive as outlined above (meaning a slight compatibility break unfortunately)

Also note that the is_* predicates can handle parameterized types while the *Types() functions by essence cannot really handle them fully if there is a large or infinite number of parametrizations possible.

Does that make sense?

Also @lidavidm what do you think?

Aug 04 '22 12:08 pitrou

Note #13753 removes the *Types() anyways.

I'm -1 on the PR as is, but I think we should add the missing is_*(Type::type id) functions, and I think it's fine to also add overloads that take const DataType&. (This is probably overkill, but those who rely on IDE autocompletion may also appreciate DataType::is_integer etc.)

"primitive" is hard to define and it depends on what the user is after. I might favor a set of more specific predicates like the ones we have for template metaprogramming, e.g. has_c_type (which I think is roughly what the proposed is_primitive is: NA and nested types don't have a C type, but binary-likes do).

Aug 04 '22 12:08 lidavidm

OK, so I'll avoid changes around the "primitive" functions and fix this PR to only:

add the missing is_numeric, is_binary, is_string, is_temporal and is_interval predicates
add overloads (same names) for all is_* predicates that accept const DataType
test these predicates

Aug 04 '22 12:08 rtpsw

Anyone knows what the errors in the Dev jobs mean?

Aug 04 '22 18:08 rtpsw

I think you can ignore those, some files need to be bumped post-release

Aug 04 '22 18:08 lidavidm

@pitrou, is this good to go?

Aug 15 '22 18:08 rtpsw

Sorry for the delay @rtpsw . I rebased from master and did a couple of very minor changes.

Aug 17 '22 13:08 pitrou

Will wait for CI now...

Aug 17 '22 13:08 pitrou

Benchmark runs are scheduled for baseline = cef68940c68ac2f3167cc6cafe5eefdd9f7fab79 and contender = f0688d01c465417e6f3515f9344154ad6f47ba22. f0688d01c465417e6f3515f9344154ad6f47ba22 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:25.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:1.46% :arrow_up:0.44%] test-mac-arm [Failed] ursa-i9-9960x [Failed] ursa-thinkcentre-m75q Buildkite builds: [Finished] f0688d01 ec2-t3-xlarge-us-east-2 [Failed] f0688d01 test-mac-arm [Failed] f0688d01 ursa-i9-9960x [Finished] f0688d01 ursa-thinkcentre-m75q [Finished] cef68940 ec2-t3-xlarge-us-east-2 [Finished] cef68940 test-mac-arm [Failed] cef68940 ursa-i9-9960x [Failed] cef68940 ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Aug 17 '22 23:08 ursabot