[FEA] Consider default nullability setting when converting cudf data to arrow C Data interface
Is your feature request related to a problem? Please describe.
The implementation of to_arrow_schema.cpp sets the nullability of a column in the schema based on whether the cudf column contains nulls. This behavior differs from the default behavior of Arrow, where schema are created as nullable by default. However, the behavior doesn't seem quite so straightforward in all cases. In pylibcudf testing, we currently ignore field nullability by default because this proved problematic in https://github.com/rapidsai/cudf/pull/16548, but simply flipping the behavior of the above line to always make columns nullable also did not resolve the discrepancy (i.e. setting check_field_nullability=True causes tests to fail).
Describe the solution you'd like We should determine whether there is a different choice of default that would be more compatible with all of the cases that we need to support.
Describe alternatives you've considered The status quo may also be acceptable. There don't appear to be many cases where nullability of columns containing no nulls is relevant outside of our more stringent tests. However, https://github.com/rapidsai/cudf/pull/16590#pullrequestreview-2248058144 indicates that there may be edge cases where this is important and we might need to adjust for those, hence opening this issue.
FYI @zeroshade in case you have opinions on the right choice here.
My opinion is generally that if at all possible we want to be able to properly roundtrip an arrow record batch/schema through libcudf and get the same nullability values back. that's the ideal. Given that nullability isn't tracked in libcudf tables or columns, I'm not sure what the best route is.
If we're picking a default, the safest default is always nullability being true.
cudf columns technically have a nullable property, but it's really viewed as an implementation detail of "does this column have a null mask allocated" right now. We could consider making that a more publicly meaningful and strict definition to better match arrow if we thought that was important. I don't think we've hit any use case requiring it before, but round-tripping arrow data is one case where it would be necessary. @davidwendt WDYT? My first instinct is that the change wouldn't be worth the work and may have a nontrivial cost of allocating a mask for a column with no nulls more often than we would want.
cudf columns technically have a
nullableproperty, but it's really viewed as an implementation detail of "does this column have a null mask allocated" right now. We could consider making that a more publicly meaningful and strict definition to better match arrow if we thought that was important. I don't think we've hit any use case requiring it before, but round-tripping arrow data is one case where it would be necessary. @davidwendt WDYT? My first instinct is that the change wouldn't be worth the work and may have a nontrivial cost of allocating a mask for a column with no nulls more often than we would want.
As a general rule, libcudf normally tries to return a column with the same nullability as the input column in any given API. This means if the input column has a validity-mask (whether there are any nulls or not in the column), the output column will also contain a validity-mask (again, whether or not there are actual nulls). I don't think we have been strict about it and I'm not exactly sure where the rule originated if not from Arrow. I would be in favor in creating a more consistent set of rules either way and fixing the code where necessary.