arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++] Improve C++ Orc Adapter performance and memory footprint

Open asfimport opened this issue 6 years ago • 8 comments

Currently the Arrow C++ provide a naive adapter implementation that allow user to read orc file to Arrow RecordBatch. However, this implementation have several drawbacks:

  • Inefficient conversion that incurs huge memcpy overhead
    • currently the ORC adapter are performing byte to byte memcpy to move data to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC VectorBatch shares the same memory layout with Arrow in most of the Data Types
  • Huge memory footprint because the lack of TableReader implementation
    • The ORC adapter currently only allow user to read data with the unit of stripe. However, as a columnar format with high compression ration, data read from a ORC stripe can potential takes over gigabytes of memory, which makes the ORC adapter not quite usable in production environment.

      Here we propose a new ORC adapter implementation to fix the issues mentioned above:

  • To reduce conversion overhead, instead of performing naive data copy, the new adapter would be able to fully taking advantage of the memory layout similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new adapter will perform pointer manipulation to transfer the memory ownership from VectorBatch to Arrow RecordBatch whenever possible.
  • The new ORC Adapter would be able to provide user a row level granularity when reading data from Orc File. The user should be able to specify how many rows should be expected on output RecordBatch and the ORC Adapter should make sure no more the requested number of rows would be returned.

Reporter: Yurui Zhou / @yuruiz

Subtasks:

PRs and other links:

Note: This issue was originally created as ARROW-4713. Please see the migration documentation for further details.

asfimport avatar Feb 28 '19 09:02 asfimport

Wes McKinney / @wesm: Seems like there may be many tasks here, please create sub-tasks if you want to break the work into multiple patches

asfimport avatar Feb 28 '19 15:02 asfimport

Antoine Pitrou / @pitrou: [~Yurui Zhou] Are you still planning to work on these issues at some point?

asfimport avatar Sep 24 '20 13:09 asfimport

Krisztian Szucs / @kszucs: Moving to 4.0.

asfimport avatar Jan 12 '21 14:01 asfimport

Ian Alexander Joiner / @iajoiner: Hmm..this looks interesting. If @Yurui Zhou won’t take it I potentially can. However I don’t think I have time for that before July though. So if I will take it it will need to happen half a year later and won’t be available in 4.0.

asfimport avatar Jan 18 '21 04:01 asfimport

Todd Farmer / @toddfarmer: This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

asfimport avatar Jul 12 '22 14:07 asfimport

@iajoiner Do you still plan to work on this?

XinyuZeng avatar Mar 02 '23 08:03 XinyuZeng

@XinyuZeng Sorry I don't really have time for ORC right now. Please go ahead and take it if you want to.

iajoiner avatar Mar 02 '23 15:03 iajoiner

This issue has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this issue will be closed in 14 days. If this improvement is still desired but has no current owner, please add the 'Status: needs champion' label.

github-actions[bot] avatar Dec 14 '25 11:12 github-actions[bot]