arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-42240: [R] Fix crash in ParquetFileWriter$WriteTable and add WriteBatch

Open amoeba opened this issue 1 year ago • 3 comments

Rationale for this change

See https://github.com/apache/arrow/issues/42240.

What changes are included in this PR?

  • Fixes a crash in ParquetFileWriter$WriteTable by asserting the class of what's passed in and stopping if it's not a Table
  • Since I was already there, added WriteBatch to match pyarrow.parquet.ParquetWriter.write_batch which is just a convenience

Are these changes tested?

Yes.

Are there any user-facing changes?

New method on ParquetFileWriter (WriteBatch).

  • GitHub Issue: #42240

amoeba avatar Jun 21 '24 07:06 amoeba

:warning: GitHub issue #42240 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jun 21 '24 07:06 github-actions[bot]

I pushed up a change addressing that comment and noticed I missed documenting the new class method so I added another commit for that. Letting CI run and then I'll merge.

amoeba avatar Jul 02 '24 20:07 amoeba

One of the tests this PR adds, in conjunction with the minimum supported C++ version check, caught a segfault in Arrow C++ that wasn't fixed until Arrow 15. I think we need to put a closed-check in the R package somewhere to catch this. I'll look into that and include a note that the check can be removed once we bump the minimum version to or above 15.

amoeba avatar Jul 03 '24 17:07 amoeba

It wasn't easy to expose the writer's closed_ state to the R package for Arrow C++ 13/14 so I opted just to skip the test on Arrow C++ <15, see 22a706b2e2d5eaf7cbacbbcf6de6a9cfa2132a09. @jonkeane can you take a quick look?

amoeba avatar Jul 03 '24 20:07 amoeba

It wasn't easy to expose the writer's closed_ state to the R package for Arrow C++ 13/14 so I opted just to skip the test on Arrow C++ <15, see https://github.com/apache/arrow/commit/22a706b2e2d5eaf7cbacbbcf6de6a9cfa2132a09. @jonkeane can you take a quick look?

Does this mean we should bump our minimum libarrow to 15 then? If we can't support it, that's the way to go IMO. The backwards compatibility with libarrow is nice, but as far as I know it's not used anywhere specifically. It was added with the hopes that we might use system-library libarrow on CRAN among other places, but that hasn't happened.

jonkeane avatar Jul 04 '24 15:07 jonkeane

Hrm, maybe. I think the only thing to do other than that would be to duplicate the open/closed state tracking on the R side until we bump the minimum version for some other reason. I think it may be relevant that this segfault is just in the manual invocation of the ParquetFileWriter and I'm not sure if that impacts the decision-making process here.

amoeba avatar Jul 04 '24 18:07 amoeba

Do we know of any actual uses of the backwards compatibility in the wild?

jonkeane avatar Jul 04 '24 18:07 jonkeane

I'm not aware of any.

amoeba avatar Jul 04 '24 18:07 amoeba

I'm not aware of any.

In that case I think we should just require >15 and that's ok (that might also let us delete some of the ifdefs too).

jonkeane avatar Jul 04 '24 22:07 jonkeane

Okay @jonkeane, I've bumped the minimum version from 13 to 15 and updated the PR body. There was only one ARROW_VERSION_MAJOR guard to be removed which I removed.

amoeba avatar Jul 09 '24 22:07 amoeba

Thanks! Sorry I should have linked this before, but we also need to change https://github.com/apache/arrow/blob/89fd5664b942f0cec1c51a4a17610aac3015d080/r/tools/check-versions.R#L27 too yeah?

jonkeane avatar Jul 09 '24 22:07 jonkeane

My fault for missing that. I changed that and added a NEWS.md entry while I was at it.

amoeba avatar Jul 09 '24 22:07 amoeba

I think CI is happy now that I pushed https://github.com/apache/arrow/pull/42241/commits/c2d39d0dff6fa3f14389352b6436e94d7affa73c.

amoeba avatar Jul 10 '24 00:07 amoeba

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 84df3438e39b470ec9bfe77e682a14bda07b0921.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.