folly icon indicating copy to clipboard operation
folly copied to clipboard

Use folly::IsRelocatable in folly::small_vector

Open Quuxplusone opened this issue 2 years ago • 1 comments

The relocateInlineTrivial operation here is exactly "move-construct plus destroy," which is the same operation optimized into memcpy by FBVector::make_window and FBVector::reserve. So we can optimize it into memcpy here for the same set of types, namely, trivially relocatable types.

The existing kShouldCopyInlineTrivial is also used in the copy constructor, where we're doing "copy-construct, do not destroy." I believe the place I changed in this patch is the only place where we can use true relocation.

FWIW, I don't understand why this optimization can be disabled depending on the value of hardware_constructive_interference_size; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial std::copy. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

Quuxplusone avatar Feb 14 '23 16:02 Quuxplusone

@Orvid has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Feb 17 '23 19:02 facebook-github-bot

@Orvid one-year ping! :) I kinda forgot about this PR, but it's still applicable as far as I'm concerned. Please let me know if there's anything I can do to move it along.

Quuxplusone avatar Feb 11 '24 17:02 Quuxplusone

I've just made a couple of updates internally (basically just inlining things since these are only used in one place) to address feedback from reviewers. This should be good to go once the diff is approved internally.

Orvid avatar Feb 12 '24 21:02 Orvid

@Orvid merged this pull request in facebook/folly@b6762ac6b9a29f66e966a1da07f5e7b3ba3551be.

facebook-github-bot avatar Feb 14 '24 21:02 facebook-github-bot

FWIW, I don't understand why this optimization can be disabled depending on the value of hardware_constructive_interference_size; it seems to me that no matter what your cache line size is, memcpy will always be superior to any non-trivial std::copy. Either both approaches will hit cache effects (and memcpy will be faster), or neither approach will hit cache effects (and memcpy will be faster). But that's tangential to this patch's purpose, so I didn't mess with it.

Reducing the scope of the optimization actually had an impact on internal production workloads. The problem is not the memcpy but the fact that we're copying the whole inline storage, regardless of how many elements there are in it. This is a speculative optimization that can backfire when you have a large inline size that crosses multiple cache lines but frequently copy/move vectors that have only a few elements set; in this case you're doing extra cache line loads and stores that you wouldn't be doing with a sized loop.

ot avatar Feb 20 '24 18:02 ot

@ot: Ah, I either misread the old code or had a brain fart. We're not comparing "non-trivial copy size() elements" against "memcpy size() elements"; we're comparing "trivial copy (i.e. memcpy) size() elements" (saves cache lines, can't fully unroll) against "memcpy capacity() elements" (costs cache lines, can fully unroll).

In that case/sense, I think the simpler patch that @Orvid merged ( b6762ac6b9a29f66e966a1da07f5e7b3ba3551be ) is actually worse than the patch I initially submitted. IIUC, the status quo ante was

  • small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<int*, 1000>'s move-constructor (calls uninitialized_copy which) memcpys only 8*size() bytes
  • small_vector<unique_ptr<int>, 4>'s move-constructor calls uninitialized_copy which does something non-trivial on 8*size() bytes
  • small_vector<unique_ptr<int>, 1000>'s move-constructor calls uninitialized_copy which does something non-trivial on 8*size() bytes

Then, IIUC, b6762ac6b9a29f66e966a1da07f5e7b3ba3551be changed the behavior to this:

  • small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<int*, 1000>'s move-constructor invariably memcpys 8000 bytes (oops!)
  • small_vector<unique_ptr<int>, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<unique_ptr<int>, 1000>'s move-constructor invariably memcpys 8000 bytes (oops!)

What we actually want (and what I think my #1934 actually did!) is

  • small_vector<int*, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<int*, 1000>'s move-constructor (calls uninitialized_copy which) memcpys only 8*size() bytes
  • small_vector<unique_ptr<int>, 4>'s move-constructor invariably memcpys 32 bytes
  • small_vector<unique_ptr<int>, 1000>'s move-constructor (calls the moral equivalent of P1144 uninitialized_relocate which) memcpys only 8*size() bytes

Does that make sense? What's the next step here?

Quuxplusone avatar Feb 20 '24 19:02 Quuxplusone

If I understand correctly the intent of the patch, it is just to extend the optimization to relocatable but non-trivially copyable types, correct? And we need the memcpy instead of std::copy because while for trivial objects we can expect the compiler to lower the copy to a memcpy, we cannot reasonably expect that for general relocatable bytes.

I think that the best way would have been to just change the definition of kShouldCopyInlineTrivial (and switch to memcpy accordingly). This has the additional benefit of making the optimization consistent for all the places where it is used, while currently the move constructor is a special case.

I'll submit a patch.

ot avatar Feb 20 '24 20:02 ot

If I understand correctly the intent of the patch, it is just to extend the optimization to relocatable but non-trivially copyable types, correct?

Yes, but only in this place, which uses relocation. The constant kShouldCopyInlineTrivial is used in several other places (e.g. in the copy constructor) where it would be very incorrect to start memcpying things just because they happen to be trivially relocatable. Basically, the archetypical "trivially relocatable" types are unique_ptr and shared_ptr: they're safe to relocate with memcpy in this place, but they're totally not safe to copy with memcpy in those other places.

I would first recommend taking my original patch as submitted; or, silver-medal idea, your PR might try splitting out the is_trivially_copyable part from kShouldCopyInlineTrivial. Instead of these three pieces:

  static constexpr bool kShouldCopyInlineTrivial =
      is_trivially_copyable_v<Value> &&
      sizeof(InlineStorageType) <= hardware_constructive_interference_size / 2;

  small_vector(small_vector const& o) {
    if (kShouldCopyInlineTrivial && !o.isExtern()) {
      copyInlineTrivial<Value>(o);
      return;
    }

      if constexpr (IsRelocatable<Value>::value) {
        // Copy the entire inline storage, instead of just size() values, to

the first step should be to get to these three pieces:

  static constexpr bool kWholeBufferIsSmall =
      sizeof(InlineStorageType) <= hardware_constructive_interference_size / 2;

  small_vector(small_vector const& o) {
    if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall && !o.isExtern()) {
      copyInlineTrivial<Value>(o);
      return;
    }

      if constexpr (IsRelocatable<Value>::value && kWholeBufferIsSmall) {
        // Copy the entire inline storage, instead of just size() values, to

Quuxplusone avatar Feb 20 '24 21:02 Quuxplusone

Yes, but only in this place, which uses relocation

Move-assignment should benefit from this too, why not use it there as well?

ot avatar Feb 20 '24 21:02 ot

Yes, but only in this place, which uses relocation

Move-assignment should benefit from this too, why not use it there as well?

One could do something there, but it would be "not obvious." Basically you'd have to replace this current code:

        if (kShouldCopyInlineTrivial) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else {
          const size_t oSize = o.size();
          detail::partiallyUninitializedCopy(
              std::make_move_iterator(o.u.buffer()),
              oSize,
              this->u.buffer(),
              size());
          this->setSize(oSize);
          o.clear();
        }

with something like this:

        if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else if (isRelocatable<Value>::value && this->size() < o.size()) {
          // Move-assigning [X Y Z _ _] = [A B C D E]. First, move-assign the common prefix:
          std::move(o.u.buffer(), o.u.buffer() + this->size(), this->u.buffer());
          // Now we have [A B C _ _] on the left and [a b c D E] on the right. Relocate the suffix:
          std::memcpy(this->u.buffer() + this->size(), o.u.buffer() + this->size(), (oSize - this->size()) * kSizeOfValue);
          // Now we have [A B C D E] on the left and [a b c _ _] on the right. Update the sizes and we're done.
          o.setSize(this->size());
          this->setSize(oSize);
        } else {
          const size_t oSize = o.size();
          detail::partiallyUninitializedCopy(
              std::make_move_iterator(o.u.buffer()),
              oSize,
              this->u.buffer(),
              size());
          this->setSize(oSize);
          o.clear();
        }

(where _ means "no object," uppercase means "object with a value," and lowercase means "moved-from, but not destroyed, object").

My point is, yes you could do that, but that's significantly more complicated than the original point of my #1934. If you want to scope-creep it to also do that, sure, but it's kind of a whole separate thing from either "#1934 as originally designed" or "fixing-forward to eliminate b6762ac6b9a29f66e966a1da07f5e7b3ba3551be's large memcpy."

Quuxplusone avatar Feb 20 '24 22:02 Quuxplusone

...Oh, or you could do move-assignment by "destroy and relocate-into", like this:

        if (is_trivially_copyable_v<Value> && kWholeBufferIsSmall) {
          copyInlineTrivial<Value>(o);
          o.resetSizePolicy();
        } else if (isRelocatable<Value>::value) {
          // [X Y Z _ _] on the left, [A B C D _] on the right
          std::destroy(this->u.buffer(), this->u.buffer() + this->size());
          // [_ _ _ _ _] on the left, [A B C D _] on the right
          std::memcpy(this->u.buffer(), o.u.buffer(), (kWholeBufferIsSmall ? MaxInline : oSize) * kSizeOfValue);
          // [A B C D _] on the left, [_ _ _ _ _] on the right
          o.setSize(0);  // n.b.: NOT o.clear()!
          this->setSize(oSize);
        } else {
          [...]
        }

I see now that this is actually what AMC does. I have nothing against that, just that again it's kind of unrelated to the original intent of this PR.

Quuxplusone avatar Feb 20 '24 22:02 Quuxplusone

it's kind of unrelated to the original intent of this PR

I'm not sure what the original intent was, but I think it is better if move-assigning into an empty container behaves the same way as move-construction. The additional complexity is just a couple of lines, so it seems like a no brainer to me.

I landed the change in https://github.com/facebook/folly/commit/2459e172518a7b5d55292b0d8b741ff14def87ba

ot avatar Feb 22 '24 00:02 ot

Cool, 2459e172518a7b5d55292b0d8b741ff14def87ba LGTM! (At least I see nothing wrong with it in a five-minute glance.) FWIW, I like your refactor into moveInlineStorageRelocatable that pushes down the branch on "how much do we copy?" to the lower more hidden level. Presumably the next step could be to introduce an analogous copyInlineStorageTrivial to replace/subsume the existing copyWholeInlineStorageTrivial. (I suppose now I'm questioning the choice of the verb "move" there, instead of relocateInlineStorageTrivial. :)) Anyway — thanks!

Quuxplusone avatar Feb 22 '24 01:02 Quuxplusone

Presumably the next step could be to introduce an analogous copyInlineStorageTrivial to replace/subsume the existing copyWholeInlineStorageTrivial.

I don't think there would be much benefit in doing that, the special case is really about when it is beneficial to copy the whole buffer because it turns into a small number of full-word MOVs, instead of a loop of unpredictable length. The general case already lowers to a memcpy when the value type is trivial, with any reasonable compiler.

ot avatar Feb 23 '24 14:02 ot