STL icon indicating copy to clipboard operation
STL copied to clipboard

<chrono>: Windows x64 ABI: bad performance with wrapped data like chrono::seconds

Open bernd5 opened this issue 5 years ago • 15 comments

I observed that if I return for example a std::chrono::seconds object from a not inlined method / function my code becomes 5 times slower compared to direct usage of long long (x64 compilation on Windows).

The reason for this is the Windows x64 ABI. See: https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019

According to this spec each value (which has a base class or custom constructor) is returned via stack and not via using a register. I would really like to use these wrapped data structures and others. But I can't accept such a big performance hit.

Is there any way to explicitly tell the compiler to return such simple values via register (exactly like the underlying data)?

To reproduce the issue I show you some simple code:

__declspec(noinline)
std::chrono::seconds Foo() {
    return std::chrono::seconds{ 42 };
}

__declspec(noinline)
long long Foo2() {
    return 42;
}

int main()
{
    auto f = Foo();
    auto f2 = Foo2();
    std::cout << "Hello World!\n" << *reinterpret_cast<long long*>(&f) << f2;
}

The code results in the following assembly:

Foo:

00007FF737C31000  mov         qword ptr [rcx],2Ah  
00007FF737C31007  mov         rax,rcx

Instead of Foo2:

00007FF737C31010  mov         eax,2Ah

For my project I use only a single compiler and don't care ABI compatibility across compilers.

vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.

bernd5 avatar Feb 12 '20 16:02 bernd5

@bernd5 IIRC you can use __vectorcall for this if your types qualify as Homogenous Vector Aggregates. The documentation on HVA's is a bit more vague than I'd like, though, so I've always tested it using godbolt (or similar). It works with the vector types we use in our codebase just fine, even with those having base classes (CRTP for some special-casing).

marzer avatar Feb 12 '20 18:02 marzer

Having said that, I don't think __vectorcall helps at all for class constructors or aggregate initializers.

marzer avatar Feb 12 '20 18:02 marzer

I'm not aware of anything the library could do differently here, but we should investigate this for the binary-incompatible vNext release.

StephanTLavavej avatar Feb 13 '20 03:02 StephanTLavavej

I'm not sure the calling convention can be changed for this, as AFAIK it would affect existing COM and C APIs returning structs (DirectX has a couple ones). Introducing a new calling convention might work, but then it'd be x86 all over again.

However, the calling convention specifies that (emphasis mine)

To return a user-defined type by value in RAX, it must have a length of 1, 2, 4, 8, 16, 32, or 64 bits. It must also have no user-defined constructor, destructor, or copy assignment operator; no private or protected non-static data members; no non-static data members of reference type; no base classes; no virtual functions; and no data members that do not also meet these requirements.

So this can be worked around by making the data member public: https://godbolt.org/z/wS_gNv

sylveon avatar Feb 13 '20 04:02 sylveon

Wow, thanks! I didn't know that the access control mattered. (I think this also means we need to be extremely careful about changing the access control of data members.)

StephanTLavavej avatar Feb 13 '20 04:02 StephanTLavavej

That is indeed good to know. I've been wondering how low cost such "strong types" are in practice for quite some time. This doesn't help for std::chrono::duration though right? Because those types need to have user-defined constructors for conversion to/from other types anyway.

@bernd5: How much impact does this have on your actual code though? I mean, if a function is called often enough that the call overhead matters for overall performance, it would probably be a good candidate for inlining regardless of the return type.

MikeGitb avatar Feb 13 '20 08:02 MikeGitb

I observed this in a recursive function called with runtime data which can't be inlined (and was not inlined). The overall impact is enormous. With pure native types this function is executed 5 times faster. And this is not really surprisingly for me because register access is very much faster compared to memory access (even with CPU caches).

With C++03 aggregates (no base classes, constructors nor private/protected instance data) return values can be submitted via RAX register - which would work in this case.

But even if we accpet these strong restrictions and define a C++03 aggregate for a double value, we loose performance. It's transfered via RAX register and not XMM0 (but copy from RAX to XMM0 is fast).

Because on Linux are different calling conventions, there seems to be no perfomance loss. On godbolt you can see that the wrapped code is compiled to exactly the same assembly: GodBolt. This seems to be even true for gcc on Windows.

So in general this behaviour is very compiler / Windows specific. I can understand that there are maybe many people which don't care so much about perfomance but really need ABI compatibility / stability. So my suggestion is to introduce some new declspec and maybe some flag to allow better assembly gen:

__declspec(treat_as(long long))

bernd5 avatar Feb 13 '20 11:02 bernd5

After some more thinking, I think the "return via register" rule can be changed to simply include all trivially copyable types, as types returned via a C or COM API should already be trivially copyable because C/COM has no concept of copy constructor and types which match the existing rule are obviously trivially copyable.

The only exception I see, since a COM vtable is function pointers, and function pointers are expected to call free-standing functions, code manually making use of the vtable (such as C code using COM) cannot call COM functions returning structs because it will try to find the return value in the register but C++ member functions always use the stack for returning the value (https://godbolt.org/z/y1KsxE). There is a MIDL change incoming that would make the C function pointer prototype codegen consider this. So such a calling convention change could be used for normal class members, but not for virtual class members, as it would break COM consumers and producers (which are supposed to be independent from the C++ ABI).

However, this is something that the compiler team should look at, not the standard library team.

sylveon avatar Feb 13 '20 13:02 sylveon

Could you forward this issue to the compiler team?

bernd5 avatar Feb 14 '20 11:02 bernd5

The compiler team is aware of this issue. We'll keep this issue open for vNext, to investigate working around this to some extent. Thanks!

(Marking as decision needed because we need to figure out exactly what the compiler will be doing with the Core Language ABI, and how much we want to make data members public.)

StephanTLavavej avatar Feb 19 '20 21:02 StephanTLavavej

This sounds like it might be a case for evangelizing something like [[trivial_abi]] to the compiler team (https://quuxplusone.github.io/blog/2018/05/02/trivial-abi-101/). The use case Arthur outlines is almost exactly this situation here.

seanmiddleditch avatar Feb 20 '20 18:02 seanmiddleditch

Removing decision needed as vNext supersedes it (we'll need to decide every major change when we start work on vNext anyways).

StephanTLavavej avatar Nov 09 '22 22:11 StephanTLavavej

std::thread::id could also be a type that is suitable for this optimization

AlexGuteniev avatar Apr 18 '23 08:04 AlexGuteniev

reference_wrapper

Inspired by #4036

AlexGuteniev avatar Sep 19 '23 04:09 AlexGuteniev

The optimization cannot be implemented from the library side without involving the compiler. Even adding constructors, except defaulted ones, breaks pod-ness.

AlexGuteniev avatar Nov 10 '25 11:11 AlexGuteniev