GH-42240: [R] Fix crash in ParquetFileWriter$WriteTable and add WriteBatch
Rationale for this change
See https://github.com/apache/arrow/issues/42240.
What changes are included in this PR?
- Fixes a crash in
ParquetFileWriter$WriteTableby asserting the class of what's passed in and stopping if it's not aTable - Since I was already there, added
WriteBatchto matchpyarrow.parquet.ParquetWriter.write_batchwhich is just a convenience
Are these changes tested?
Yes.
Are there any user-facing changes?
New method on ParquetFileWriter (WriteBatch).
- GitHub Issue: #42240
:warning: GitHub issue #42240 has been automatically assigned in GitHub to PR creator.
I pushed up a change addressing that comment and noticed I missed documenting the new class method so I added another commit for that. Letting CI run and then I'll merge.
One of the tests this PR adds, in conjunction with the minimum supported C++ version check, caught a segfault in Arrow C++ that wasn't fixed until Arrow 15. I think we need to put a closed-check in the R package somewhere to catch this. I'll look into that and include a note that the check can be removed once we bump the minimum version to or above 15.
It wasn't easy to expose the writer's closed_ state to the R package for Arrow C++ 13/14 so I opted just to skip the test on Arrow C++ <15, see 22a706b2e2d5eaf7cbacbbcf6de6a9cfa2132a09. @jonkeane can you take a quick look?
It wasn't easy to expose the writer's closed_ state to the R package for Arrow C++ 13/14 so I opted just to skip the test on Arrow C++ <15, see https://github.com/apache/arrow/commit/22a706b2e2d5eaf7cbacbbcf6de6a9cfa2132a09. @jonkeane can you take a quick look?
Does this mean we should bump our minimum libarrow to 15 then? If we can't support it, that's the way to go IMO. The backwards compatibility with libarrow is nice, but as far as I know it's not used anywhere specifically. It was added with the hopes that we might use system-library libarrow on CRAN among other places, but that hasn't happened.
Hrm, maybe. I think the only thing to do other than that would be to duplicate the open/closed state tracking on the R side until we bump the minimum version for some other reason. I think it may be relevant that this segfault is just in the manual invocation of the ParquetFileWriter and I'm not sure if that impacts the decision-making process here.
Do we know of any actual uses of the backwards compatibility in the wild?
I'm not aware of any.
I'm not aware of any.
In that case I think we should just require >15 and that's ok (that might also let us delete some of the ifdefs too).
Okay @jonkeane, I've bumped the minimum version from 13 to 15 and updated the PR body. There was only one ARROW_VERSION_MAJOR guard to be removed which I removed.
Thanks! Sorry I should have linked this before, but we also need to change https://github.com/apache/arrow/blob/89fd5664b942f0cec1c51a4a17610aac3015d080/r/tools/check-versions.R#L27 too yeah?
My fault for missing that. I changed that and added a NEWS.md entry while I was at it.
I think CI is happy now that I pushed https://github.com/apache/arrow/pull/42241/commits/c2d39d0dff6fa3f14389352b6436e94d7affa73c.
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 84df3438e39b470ec9bfe77e682a14bda07b0921.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.