simdcsv icon indicating copy to clipboard operation
simdcsv copied to clipboard

bad msvc performance

Open reder2000 opened this issue 3 years ago • 10 comments

I've proposed a PR (#10) that allows to compile on MSVC.

However the performance is awful (0.65GB/sec instead of 11GB/sec on the same machine, with MSVC Clang)

If anybody is able to look into that.

reder2000 avatar Jul 01 '22 16:07 reder2000

Side note: it has nothing to do with __builtin_prefetch. MSVC Clang accepts it, does not change results.

reder2000 avatar Jul 01 '22 17:07 reder2000

Incidental findings: defining #define really_inline __declspec(noinline) MSVC perfomance jumps to 5.6GB/sec. Much better.

reder2000 avatar Jul 01 '22 17:07 reder2000

The use case for __declspec(noinline) are functions that are small (so likely inlined) but rarely used. It does not make intuitive sense that it would help performance.

In simdjson, we have

 #define simdjson_really_inline __forceinline

In simdcsv, we have...

#ifdef _MSC_VER


#define really_inline inline
#define never_inline __declspec(noinline)

lemire avatar Jul 01 '22 17:07 lemire

Yes, discovered this by chance. MSVC results are counter-intuitive, something's wrong in their inlining.

__forceinline and inline give same results.

I tried noinline to get a split of function time while profiling...

reder2000 avatar Jul 01 '22 17:07 reder2000

it is find_quote_mask + cmp_mask_against_input that I need to un-inline

reder2000 avatar Jul 01 '22 18:07 reder2000

something's wrong in their inlining.

Though we have not worked much on simdcsv, we have worked much with simdjson and Visual Studio and I have found that, indeed, something is very wrong with the code generation. At first, we thought that Visual Studio would ignore use and not inline, but it does.

The one strange thing I have spotted in the generated code is that Visual Studio seems to sometimes get confused and copy all the registers to stack and then immediately bring them back from stack. I reported the issue to Microsoft and we even chatted with some of the people from the core Visual Studio compiler team. Unfortunately, I do not think that they investigated the issue, and I ended up giving up and moving on (after reporting the issue in details).

You can see that I did a lot of work...

https://github.com/simdjson/simdjson/issues/847#issuecomment-657858235

You can try manually inlining the functions... it should work. To be clear, manually inlining the functions give better results then relying on the compiler, under visual studio.

@reder2000 Are you able to look at the produce assembly? It could be fun to compare before/after inlining. Just do an assembly dump. I predict that you will see the registering/deregistering issue.

Note that if you use ClangCL, the problem goes away.

References

https://github.com/simdjson/simdjson/issues/847#issuecomment-653225952

lemire avatar Jul 01 '22 18:07 lemire

it is find_quote_mask + cmp_mask_against_input that I need to un-inline

Or try manually inlining them!

lemire avatar Jul 01 '22 18:07 lemire

I've been playing a while with it. It turns out that MSVC performance is worst with /Ob2 than /Ob1. With /Ob1 performance is 50% of Clang-CL at /Ob2 (and same as CLang-CL at /Ob1) I tried replacing function with macros, one gets the /Ob2 (really bad) performance whatever the flag, so too much inlining is causing MSVC to act weirdly. Those people seem to have the same kind of problems: https://github.com/xtensor-stack/xsimd/issues/617

reder2000 avatar Jul 02 '22 13:07 reder2000

Those people seem to have the same kind of problems: https://github.com/xtensor-stack/xsimd/issues/617

I think their problem is too little inlining.

lemire avatar Jul 02 '22 16:07 lemire

Maybe I read that too quickly.

I'll submit the issue to MSFT. /Ob2 worst than /Ob1 is not good news. Even though this is a corner case.

reder2000 avatar Jul 02 '22 21:07 reder2000