arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17259: [C++] Do not use shared_ptr<DataType> with kernel function signatures, do less copying of shared_ptrs

Open wesm opened this issue 2 years ago • 20 comments

This is something that I've hacked on a little bit while on airplanes and in idle moments for my own curiosity and just cleaned up to turn into a PR. This PR changes arrow::compute::InputType to only store const DataType* instead of shared_ptr<DataType>, which means that we don't have to copy the smart pointers for every kernel and function that's registered. Amazingly, this trims the size of libarrow.so by about 1% (~300KB) because of not having a bunch of inlined shared_ptr code.

This change does have consequences -- developers need to be careful about creating function signatures with data types that may have temporary storage duration (all of the APIs like arrow::int64() return static instances, and I changed arrow::duration and some other type factories to return static instances, too). But I think the benefits outweigh the costs.

wesm avatar Jul 29 '22 22:07 wesm

https://issues.apache.org/jira/browse/ARROW-17259

github-actions[bot] avatar Jul 29 '22 22:07 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Jul 29 '22 22:07 github-actions[bot]

@wesm Would you like to take a look at the compile failures?

pitrou avatar Aug 03 '22 14:08 pitrou

@pitrou yes, I'll fix them today

wesm avatar Aug 03 '22 14:08 wesm

@pitrou there's a doctest failure here unrelated to this PR, seems to be caused by the 9.0.0 to 10.0.0 version bump (which made the metadata 1 byte bigger):

2022-08-03T17:03:42.8027721Z =================================== FAILURES ===================================
2022-08-03T17:03:42.8028142Z ___________________ [doctest] pyarrow.parquet.read_metadata ____________________
2022-08-03T17:03:42.8028491Z 3406 
2022-08-03T17:03:42.8028721Z 3407     Examples
2022-08-03T17:03:42.8029041Z 3408     --------
2022-08-03T17:03:42.8029336Z 3409     >>> import pyarrow as pa
2022-08-03T17:03:42.8029679Z 3410     >>> import pyarrow.parquet as pq
2022-08-03T17:03:42.8030399Z 3411     >>> table = pa.table({'n_legs': [4, 5, 100],
2022-08-03T17:03:42.8030920Z 3412     ...                   'animal': ["Dog", "Brittle stars", "Centipede"]})
2022-08-03T17:03:42.8031437Z 3413     >>> pq.write_table(table, 'example.parquet')
2022-08-03T17:03:42.8031751Z 3414 
2022-08-03T17:03:42.8032144Z 3415     >>> pq.read_metadata('example.parquet')
2022-08-03T17:03:42.8032625Z Differences (unified diff with -expected +actual):
2022-08-03T17:03:42.8033004Z     @@ -1,7 +1,7 @@
2022-08-03T17:03:42.8033412Z     -<pyarrow._parquet.FileMetaData object at ...>
2022-08-03T17:03:42.8033907Z     -  created_by: parquet-cpp-arrow version ...
2022-08-03T17:03:42.8034339Z     +<pyarrow._parquet.FileMetaData object at 0x7fd2089b3830>
2022-08-03T17:03:42.8034877Z     +  created_by: parquet-cpp-arrow version 10.0.0-SNAPSHOT
2022-08-03T17:03:42.8035225Z        num_columns: 2
2022-08-03T17:03:42.8035477Z        num_rows: 3
2022-08-03T17:03:42.8035752Z        num_row_groups: 1
2022-08-03T17:03:42.8036047Z        format_version: 2.6
2022-08-03T17:03:42.8036407Z     -  serialized_size: 561
2022-08-03T17:03:42.8036708Z     +  serialized_size: 562
2022-08-03T17:03:42.8037076Z 
2022-08-03T17:03:42.8037502Z /opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/parquet/__init__.py:3415: DocTestFailure

wesm avatar Aug 03 '22 17:08 wesm

Hmm... I assume it's because of the serialized_size difference? It's weird that it would differ by one byte, but perhaps that's due to a different Thrift version. We should probably simply relax this test.

pitrou avatar Aug 03 '22 17:08 pitrou

I think the reason is the change from 9.0.0 to 10.0.0 which added one byte, here is the doctest

    >>> pq.read_metadata('example.parquet')
    <pyarrow._parquet.FileMetaData object at ...>
      created_by: parquet-cpp-arrow version ...
      num_columns: 2
      num_rows: 3
      num_row_groups: 1
      format_version: 2.6
      serialized_size: 561

wesm avatar Aug 03 '22 17:08 wesm

I patched this in https://github.com/apache/arrow/pull/13790

wesm avatar Aug 03 '22 19:08 wesm

I think this can be merged once the build is green if the changes look okay.

wesm avatar Aug 03 '22 21:08 wesm

InputType and OutputType instances are only created at kernel registration, if I'm not mistaken?

pitrou avatar Aug 04 '22 12:08 pitrou

But those Input/OutputType instances are stored in the KernelSignature, so those references need to stay valid.

lidavidm avatar Aug 04 '22 13:08 lidavidm

But those Input/OutputType instances are stored in the KernelSignature, so those references need to stay valid.

My question was more along the lines of: what is the point of micro-optimizing a number of shared_ptr copies that only happen at startup?

pitrou avatar Aug 04 '22 13:08 pitrou

@pitrou this isn’t a micro-optimization imho — when 1% of your binary size is concerned with inline shared_ptr code in this narrow part of the library that speaks to a design problem. The kernel registry population is still very bloated and should be able to be streamlined further to yield smaller binary size. The use of shared pointers also causes extra overhead in kernel dispatching and input type checking, but we would need to write some benchmarks to quantify that better

it might be a useful exercise at some point to analyze the disassembly of libarrow.so to understand the total contribution of inline shared_ptr code to the binary size. There are a lot of places where we inline constructors for example where there is probably no need

wesm avatar Aug 04 '22 14:08 wesm

@pitrou this isn’t a micro-optimization imho — when 1% of your binary size is concerned with inline shared_ptr code in this narrow part of the library that speaks to a design problem.

Is this looking at the text segment, or the entire binary size? If the latter, that might mean debug symbols or something... (those can be stripped for distribution purposes)

pitrou avatar Aug 04 '22 14:08 pitrou

The use of shared pointers also causes extra overhead in kernel dispatching and input type checking, but we would need to write some benchmarks to quantify that better

I would like to see benchmarks indeed, because as long as you don't copy them the overhead should be minimal.

pitrou avatar Aug 04 '22 14:08 pitrou

I took the PR and rebased it internally to then compare the sizes:

  • master:
$ size build-release/release/libarrow.so
   text	   data	    bss	    dec	    hex	filename
23471557	 317416	2568225	26357198	1922dce	build-release/release/libarrow.so
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 37289616 août   4 16:48 build-release/release/libarrow.so.1000.0.0
$ strip build-release/release/libarrow.so.1000.0.0 
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 23794552 août   4 16:48 build-release/release/libarrow.so.1000.0.0
  • PR:
$ size build-release/release/libarrow.so
   text	   data	    bss	    dec	    hex	filename
23314548	 314736	2568801	26198085	18fc045	build-release/release/libarrow.so
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 37097032 août   4 16:49 build-release/release/libarrow.so.1000.0.0
$ strip build-release/release/libarrow.so.1000.0.0 
$ ls -la build-release/release/libarrow.so.1000.0.0 
-rwxrwxr-x 1 antoine antoine 23634808 août   4 16:50 build-release/release/libarrow.so.1000.0.0

So it's a 0.7% saving here in both text segment size and stripped library size.

pitrou avatar Aug 04 '22 14:08 pitrou

Any objections to merging this?

wesm avatar Aug 05 '22 20:08 wesm

If there's no sizable performance improvement then I'm against merging this. Robustness should be a guiding principle and replacing shared pointers with raw pointers goes against it.

pitrou avatar Aug 05 '22 21:08 pitrou

I do not think the implementation is less robust — the only change is that ephemeral DataType instances cannot be used to instantiate InputType without creating a TypeMatcher that wraps and retains a shared_ptr for the cases where it is truly needed. If the developer does create a KernelSignature / InputType with an ephemeral shared_ptr<DataType>, the signature would be unusable so any unit test which attempts to use the kernel would segfault. The probability of an Arrow user ever facing a bug related to this essentially zero — it would amount to us merging untested kernels code (because tests would immediately reveal the programming error).

It also comes down to a cost philosophy in this code:

  • Pay the cost of using shared_ptr always — both in generated code (because most shared_ptr operations are inlined by the compiler) and in performance
  • Pay the cost only when it's needed

It turns out that in this code, that retaining a shared_ptr<DataType> for kernel function signatures and type validation is rarely needed. Why pay the cost for > 95% of cases (or whatever the number is?) when the shared_ptr is only actually needed in < 5% of cases? I recall that we've already seen shared_ptr contention show up unfavorably in performance profiles (e.g. ARROW-16161).

I will take the time when I can to write some benchmarks (for me, the present-and-future savings in generated code is evidence enough) that will hopefully provide some additional evidence to support this. In the meantime I hope that the rebase conflicts will not be too annoying to keep up with.

wesm avatar Aug 05 '22 21:08 wesm

I've done a little exploration locally and concluded that shared_ptr<DataType> versus const DataType* has no impact on the performance of Function::DispatchExact (at least on recent clang), which confirms the general guidance that dereferencing shared pointers is no different than a raw pointer — it's the creation, copying, and destruction of shared pointers that's expensive, especially in a multithreaded setting.

The last thing I'm interested in is the "bootup time" of the function registry, so I'm going to compare the before and after performance of that and report it here. If there is no significant difference, I will refactor this PR to restore shared pointers to InputType and OutputType but rather avoid allowing shared pointer copies to get inlined in these functions, so hopefully that will yield the same kind of reduction in binary bloat. There are some other beneficial changes here so I'd like to get those in — I'll report back when I have that extra performance data and have done the refactoring.

There's probably other places where we're either passing a shared pointer where there is no need or where we inline non-performance-sensitive-constructors which copy shared pointers where there is similarly little need, so let's be an eye out for htat.

wesm avatar Aug 08 '22 16:08 wesm

I had some personal stuff come up in August that got in the way of completing this work -- I will pick this up hopefully sometime this month (September) and hopefully reach a point that is satisfactory to all.

wesm avatar Sep 08 '22 18:09 wesm

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

amol- avatar Mar 30 '23 17:03 amol-