cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Check column type equality, handling nested types correctly.

Open bdice opened this issue 1 year ago • 6 comments

Description

Addresses most of #14527. See also #14494.

This PR expands the use of cudf::column_types_equal(lhs, rhs) and adds new methods cudf::column_scalar_types_equal, cudf::scalar_types_equal, and cudf::all_column_types_equal.

These type check functions are now employed throughout the code base instead of raw checks like a.type() == b.type() because those do not correctly handle nested types.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

bdice avatar Nov 30 '23 00:11 bdice

@wence- Is this the kind of change you had in mind? I looked over the list of files you put in #14527 and made changes for most of them.

I'd like your thoughts on what level of testing we need to merge this PR. Should we test every algorithm with mismatching inputs that have the same top-level type, like list<int> and list<float>? I haven't put in any time on test writing yet, but that seems like what we need to give it proper coverage.

bdice avatar Nov 30 '23 00:11 bdice

@wence- Is this the kind of change you had in mind? I looked over the list of files you put in #14527 and made changes for most of them.

I think if we don't want to change the structure of the data_type object, then this looks basically the same as what I was thinking of.

I'd like your thoughts on what level of testing we need to merge this PR. Should we test every algorithm with mismatching inputs that have the same top-level type, like list<int> and list<float>? I haven't put in any time on test writing yet, but that seems like what we need to give it proper coverage.

I think we want some unit tests for the equality comparisons first. Since the data-type description is an inductive type, the tests can be "exhaustive" by checking the base cases and the inductive cases, and then convincing ourselves we've implemented the inductive definition correctly. We should also then test that the table-equality comparators work correctly.

In terms of then testing the advertised API surface, can we collect which functions advertise that they throw on mismatched types and then consider the testing strategy.

One thing I wonder is whether it should be the case that any such type-equality checking happens in the detail API, or if those functions have as (implicit) preconditions that they are passed valid data. We can't encode any such requirement in the type system because columns are type-erased: the best I can think of is that detail APIs that require matching types gain an argument that carries a token indicating whether equality constraints have been satisfied by the caller. But that might not be very ergonomic. The thing I would like to avoid is redoing equality checks when the caller of some function has already performed them.

wence- avatar Nov 30 '23 12:11 wence-

One extra caveat I'd like to consider here. We are planning for STRING type columns to contain either INT32 or INT64 offsets children. In this case, I think we'd want two STRING type columns to be equal (in type only) regardless of their children types. So I would propose a special case for STRING types in the type_checks functions to ignore STRING children.

davidwendt avatar Dec 07 '23 19:12 davidwendt

@davidwendt I think the current implementation should ignore the offset children of strings columns, which will provide the behavior you want.

@wence- I haven't gotten to explore the inductive tests that you're seeking yet. Are there any places in the cudf code base that do this kind of thing generating nested types already? It seems like it won't be easy to implement.

bdice avatar Mar 27 '24 23:03 bdice

These type check functions are now employed throughout the code base instead of raw checks like a.type() == b.type()

Maybe out of the scope of the current PR but could we add a check in the data type == operator to prevent such misuse? My first immature idea was to add something like CUDF_EXPECTS(not is_nested(col.type)) to prevent using == for nested types. Then I realized we still need == to determine if the top-level types are equal (like LIST == LIST) or not. Any thoughts?

PointKernel avatar Apr 03 '24 23:04 PointKernel

I haven't gotten to explore the inductive tests that you're seeking yet. Are there any places in the cudf code base that do this kind of thing generating nested types already? It seems like it won't be easy to implement.

I think it's OK not to do this here.

wence- avatar Apr 04 '24 15:04 wence-

@wence- I applied most of your suggested changes. Feel free to have a final look, if you'd like. I'm aiming to merge in the next 2 days to avoid conflicts and get this in before 24.06 burndown begins.

bdice avatar May 01 '24 20:05 bdice

/merge

bdice avatar May 02 '24 17:05 bdice

Thanks @davidwendt @wence- @PointKernel for the reviews!

bdice avatar May 02 '24 17:05 bdice