arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-15961: [C++] Check nullability when validating fields on batches or struct arrays

Open kaoutherab opened this issue 2 years ago • 14 comments

kaoutherab avatar Mar 24 '22 14:03 kaoutherab

https://issues.apache.org/jira/browse/ARROW-15961

github-actions[bot] avatar Mar 24 '22 14:03 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Mar 24 '22 14:03 github-actions[bot]

Indeed, need to handle any kind of nested array.

pitrou avatar Mar 28 '22 13:03 pitrou

We do have to decide what to do about the Parquet tests, though…

lidavidm avatar Apr 01 '22 16:04 lidavidm

So the Parquet tests are now failing, example: https://github.com/apache/arrow/blob/623a15e7f7a45578733956714c8dddcc9f66f015/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L2816-L2830

It seems we (understandably) ignore the child array slots corresponding to top-level nulls, either on writing or on reading Parquet, but of course if the child field is marked non-nullable this becomes a problem now that we're validating this fact.

Example in Python:

>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> ints = pa.array([1,2,3,4,5,6,7,8])
>>> struct = pa.StructArray.from_arrays([ints], fields=[pa.field("ints", pa.int64(), nullable=False)], mask=pa.array([True, True, False, False, True, True, True, True]))
>>> struct
<pyarrow.lib.StructArray object at 0x7fe5bc16a440>
-- is_valid:
  [
    false,
    false,
    true,
    true,
    false,
    false,
    false,
    false
  ]
-- child 0 type: int64
  [
    1,
    2,
    3,
    4,
    5,
    6,
    7,
    8
  ]
>>> table = pa.table([struct], names=["struct"])
>>> pq.write_table(table, "test.parquet")
>>> pq.read_table("test.parquet")
pyarrow.Table
struct: struct<ints: int64 not null>
  child 0, ints: int64 not null
----
struct: [  -- is_valid:      [
      false,
      false,
      true,
      true,
      false,
      false,
      false,
      false
    ]  -- child 0 type: int64
    [
      null,
      null,
      3,
      4,
      null,
      null,
      null,
      null
    ]]
>>> table
pyarrow.Table
struct: struct<ints: int64 not null>
  child 0, ints: int64 not null
----
struct: [  -- is_valid:      [
      false,
      false,
      true,
      true,
      false,
      false,
      false,
      false
    ]  -- child 0 type: int64
    [
      1,
      2,
      3,
      4,
      5,
      6,
      7,
      8
    ]]

It seems this is done on write, looking at the underlying column:

>>> pq.read_metadata("test.parquet").row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7fe5bc1d74c0>
  has_min_max: True
  min: 3
  max: 4
  null_count: 6
  distinct_count: 0
  num_values: 2
  physical_type: INT64
  logical_type: None
  converted_type (legacy): NONE

@pitrou @emkornfield am I understanding this right? It seems we either need to write out the values of child arrays behind null slots, or fill in some value on read?

Just for fun, Feather appears to handle this correctly:

>>> import pyarrow as pa
>>> import pyarrow.feather as pf
>>> ints = pa.array([1,2,3,4,5,6,7,8])
>>> struct = pa.StructArray.from_arrays([ints], fields=[pa.field("ints", pa.int64(), nullable=False)], mask=pa.array([True, True, False, False, True, True, True, True]))
>>> table = pa.table([struct], names=["struct"])
>>> pf.write_feather(table, "test.feather")
>>> pf.read_table("test.feather")
pyarrow.Table
struct: struct<ints: int64 not null>
  child 0, ints: int64 not null
----
struct: [  -- is_valid:      [
      false,
      false,
      true,
      true,
      false,
      false,
      false,
      false
    ]  -- child 0 type: int64
    [
      1,
      2,
      3,
      4,
      5,
      6,
      7,
      8
    ]]

lidavidm avatar Apr 01 '22 21:04 lidavidm

@pitrou @emkornfield am I understanding this right? It seems we either need to write out the values of child arrays behind null slots, or fill in some value on read?

Yes, if it required will need to make sure the children are not null. I'm not surprised feather works because that is just serialized the record batch in place. Parquet only stores non-null values. Probably the element of least surprise here would be to only fill values for non-nullable columns (and leave existing behavior in for nullable).

emkornfield avatar Apr 01 '22 21:04 emkornfield

See https://github.com/kaoutherab/arrow/pull/1 for an attempt at this, though I think we're going to want to go and test each type specifically for this case.

lidavidm avatar Apr 04 '22 12:04 lidavidm

Thinking about this I don't think we should merge this right before the release, unless we can fix the underlying issue for parquet.

emkornfield avatar Apr 04 '22 17:04 emkornfield

Hmm, fair. I filed ARROW-16116

lidavidm avatar Apr 04 '22 21:04 lidavidm

See #12829 for an attempt at fixing the Parquet issue properly

lidavidm avatar Apr 07 '22 17:04 lidavidm

@kaoutherab try rebasing this PR now (but omitting the last commit to avoid conflicts)

lidavidm avatar May 04 '22 13:05 lidavidm

Rebased.

pitrou avatar Jun 07 '22 15:06 pitrou

@kaoutherab Do you want to follow up on this?

pitrou avatar Jul 12 '22 14:07 pitrou

Sure, I'll follow up on this. Thanks!

kaoutherab avatar Jul 13 '22 00:07 kaoutherab

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

amol- avatar Mar 30 '23 17:03 amol-