arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-18004: [C++] ExecBatch conversion to RecordBatch may go out of bounds

Open rtpsw opened this issue 3 years ago • 16 comments
trafficstars

See https://issues.apache.org/jira/browse/ARROW-18004

rtpsw avatar Oct 12 '22 13:10 rtpsw

cc @icexelloss @westonpace

rtpsw avatar Oct 12 '22 13:10 rtpsw

Same problem as #14347: this is basically adding a partial, incomplete guard for a condition the caller is supposed to check for themselves.

pitrou avatar Oct 12 '22 13:10 pitrou

If we wanted to check that the ExecBatch values correspond to the Schema, we should check for schema equality on each of the Datum values. That can unfortunately be expensive.

pitrou avatar Oct 12 '22 13:10 pitrou

Same problem as #14347: this is basically adding a partial, incomplete guard for a condition the caller is supposed to check for themselves.

If we wanted to check that the ExecBatch values corresponding to the Schema, we should check for schema equality on each of the Datum values. That can unfortunately be expensive.

To be sure, I understand that by partial incomplete guard you mean not checking for full schema equality. My goal here (and in the PR you linked) is not to fully guard, and I understand the cost of trying to do so. My goal is to avoid runtime crashes that are hard to debug when the cost of doing so is small. IMHO, this PR is a good tradeoff because it uses a cheap check to transform a crash failure to a raised error, which is easier to debug. I run into such crash failures while developing test cases. I agree that when a test case is correct then the failure cases do not occur, yet during development test cases are frequently not correct. Another reason is that developers of large apps would generally prefer an error the app can handle, and perhaps raise to its user, over crashing the app.

rtpsw avatar Oct 12 '22 13:10 rtpsw

Another reason is that developers of large apps would generally prefer an error the app can handle, and perhaps raise to its user, over crashing the app.

I agree with this, but it is actually misleading here, because we're only checking a single condition and otherwise let errors crash silently (or corrupt memory etc.).

pitrou avatar Oct 12 '22 13:10 pitrou

Edit: if this avoids an immediate crash and allows you to see Validate failing afterwards, then I would be ok with this (but let's not silently truncate output columns either).

pitrou avatar Oct 12 '22 13:10 pitrou

(but let's not silently truncate output columns either).

Before coding the condition if (static_cast<size_t>(schema->num_fields()) > values.size()) with > I tried coding it with != but got test failures elsewhere:

[ RUN      ] ExecPlanExecution.StressSourceOrderBy
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/plan_test.cc:739: Failure
Failed
'_error_or_value56.status()' failed with Invalid: mismatching schema size
Google Test trace:
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/plan_test.cc:720: single threaded
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/plan_test.cc:717: unslowed
[  FAILED  ] ExecPlanExecution.StressSourceOrderBy (1 ms)

and

[ RUN      ] Substrait.BasicPlanRoundTrippingEndToEnd
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:58: Plan was destroyed before finishing
/mnt/user1/tscontract/github/rtpsw/arrow/cpp/src/arrow/engine/substrait/serde_test.cc:2083: Failure
Failed
'_error_or_value131.status()' failed with Invalid: mismatching schema size
[  FAILED  ] Substrait.BasicPlanRoundTrippingEndToEnd (1 ms)

If we'd like to fix these test failures, I'd suggest doing so separately.

rtpsw avatar Oct 12 '22 14:10 rtpsw

@westonpace Are the failures mentioned above (mismatching schema size) legitimate?

pitrou avatar Oct 12 '22 14:10 pitrou

Edit: if this avoids an immediate crash and allows you to see Validate failing afterwards, then I would be ok with this

With the current PR code, exec_batch.ToRecordBatch(reject_schema) just raises an error, i.e., it does not let an invalid record batch get created. Should I add some other check, and which?

rtpsw avatar Oct 12 '22 14:10 rtpsw

@westonpace Are the failures mentioned above (mismatching schema size) legitimate?

Don't know yet; I'll need to examine. If I manage to do so quickly, I'll report here.

rtpsw avatar Oct 12 '22 14:10 rtpsw

I agree with this, but it is actually misleading here, because we're only checking a single condition and otherwise let errors crash silently (or corrupt memory etc.).

I agree the leniency of the checks, as compared to validation, could be unexpected to some developers and therefore misleading. I'd suggest adding doc strings to clarify this. OTOH, I'm not sure which crash condition you mean could still occur. Since the current PR code check the number of values and their types before applying type-specific operations, I believe it can only crash on the DCHECK(false) line. I could fix this to return an error.

rtpsw avatar Oct 12 '22 14:10 rtpsw

Ok, it looks like Acero relies on being able to silently truncate the number of fields in that method. Which is quite unfortunate.

pitrou avatar Oct 12 '22 15:10 pitrou

https://issues.apache.org/jira/browse/ARROW-18004

github-actions[bot] avatar Oct 12 '22 16:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Oct 12 '22 16:10 github-actions[bot]

This comment applies to the commit I just pushed.

rtpsw avatar Oct 12 '22 16:10 rtpsw

@pitrou, please let me know which checks to remove from the test.

rtpsw avatar Oct 12 '22 18:10 rtpsw

@pitrou, please let me know which checks to remove from the test.

In the interest of moving this forward before 10.0.0, I pushed some changes myself.

pitrou avatar Oct 13 '22 15:10 pitrou

Thanks @pitrou !

rtpsw avatar Oct 13 '22 15:10 rtpsw

CI passed on @rtpsw 's fork.

pitrou avatar Oct 13 '22 17:10 pitrou

Benchmark runs are scheduled for baseline = f3327d2c37c375abdcd6299d4ea2cdbdcbc4cb62 and contender = b5b41ccfb05bfcacc674167c50e1f2894bbd3683. b5b41ccfb05bfcacc674167c50e1f2894bbd3683 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.56% :arrow_up:0.0%] test-mac-arm [Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.82% :arrow_up:0.04%] ursa-thinkcentre-m75q Buildkite builds: [Finished] b5b41ccf ec2-t3-xlarge-us-east-2 [Failed] b5b41ccf test-mac-arm [Failed] b5b41ccf ursa-i9-9960x [Finished] b5b41ccf ursa-thinkcentre-m75q [Finished] f3327d2c ec2-t3-xlarge-us-east-2 [Failed] f3327d2c test-mac-arm [Failed] f3327d2c ursa-i9-9960x [Finished] f3327d2c ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Oct 14 '22 19:10 ursabot