Various optimizations
Hi, basically I was just messing around and did some benchmarking, found some interesting things, so I made some changes. Mostly to do with removing some of the branches and created a SIMD version of finding CRLF that doesn't need to walk the whole payload twice(worst case)
Below are the benchmark results
OSX (el capitan, mbp 13" early 2015, 3.1 GHz Intel Core i7)
Clang
| time(master) | req/s(master) | time(new) | req/s(new) | percent |
|---|---|---|---|---|
| 5.761632 | 867809.687500 | 5.104409 | 979545.312500 | 11.4069 |
| 5.769115 | 866684.062500 | 5.104001 | 979623.625000 | 11.5289 |
| 5.803461 | 861554.875000 | 5.086198 | 983052.625000 | 12.3592 |
| 5.779534 | 865121.687500 | 5.082990 | 983672.937500 | 12.0519 |
| 5.762670 | 867653.375000 | 5.092542 | 981827.875000 | 11.6288 |
GCC 4.9
| time(master) | req/s(master) | time(new) | req/s(new) | percent |
|---|---|---|---|---|
| 4.297466 | 1163476.375000 | 3.787979 | 1319965.000000 | 11.8555 |
| 4.159397 | 1202097.250000 | 3.774015 | 1324849.000000 | 9.26533 |
| 4.230322 | 1181943.125000 | 3.788900 | 1319644.250000 | 10.4347 |
| 4.214539 | 1186369.375000 | 3.772407 | 1325413.750000 | 10.4906 |
| 4.162938 | 1201074.750000 | 3.763118 | 1328685.375000 | 9.6042 |
ubuntu 14.04 ( Intel Core i7-6700 @ 3.40GHz )
| time(master) | req/s(master) | time(new) | req/s(new) | percent |
|---|---|---|---|---|
| 3.186016 | 1569358.000000 | 2.815905 | 1775628.000000 | 11.6167 |
| 3.186449 | 1569144.875000 | 2.846761 | 1756382.125000 | 10.6604 |
| 3.190864 | 1566973.625000 | 2.877822 | 1737425.125000 | 9.81057 |
| 3.183230 | 1570731.625000 | 2.813425 | 1777193.250000 | 11.6173 |
| 3.190520 | 1567142.625000 | 2.849443 | 1754728.875000 | 10.6903 |
I am not sure how to ignore whitespace when you view the diff, the changes are actually very minimal
@shekhei you can use https://github.com/nodejs/http-parser/pull/342/files?w=1 to see the diff without whitespace changes.
BTW it looks like the style in this file is to use lower-case for labels, not upper-case as you have for HEADER_FIELD_BEGIN
@shekhei thank you for doing this! I've experimented with vector intrinsics some time before, but for whatever reason I had at the moment decided to not proceed further.
Is the most of the speedup achieved by using vector intrinsics? How much do other changes contribute?
The fallthroughs and h_general jmp was about ~20-30%. the vector stuff really depends on the size of the headers and the builtin memchr variants.
I was really surprised about the gcc and clang difference though, if i use gcc 6 to compile it, my god its almost twice as fast as the clang one on osx.
@craig65535 you are right, let me modify them to lower case, and using snake for function names.
hmm.. any news on this pr?
hmm... :P
hmm... any news?
Eager to see this landed. I get about a 21% speedup on an i7-7700HQ on Windows.
Just tested a wider version (256 bits, requiring AVX2) and get another 8% on top of that. This SSE2-only version needs no dispatching though, which makes it easy. Happy to help with an AVX2 implementation (using either preprocessor directives or dispatching) if there's interest, though.
Ok, it seems like my whole branch has gone cranky after not updating for so long, let me clean it up
Ok, latest changes made
@indutny What do you think of landing this awesome PR? Will be nice to have this core part of node.js sped up!
This PR is awesome indeed!
I'll try to find time to review it this week. Sorry for not doing it sooner?
@bnoordhuis @mscdex your help would be very appreciated.
It would be interesting to see how much of a difference there would be in using NEON on ARM for the parts that use intrinsics. I might try that this weekend.
Clang doesn’t define __GNUC__, yet it does support emmintrin.h, please we need to remove the __GNUC__ checks and just leave __SSE2__ to make this code efficient also with clang.
Clang does define __GNUC__ actually. (Have you found some version that doesn't?)
Well, here it wasn’t compiling that part and it was because of that __GNUC__ check. I'll dig deeper.
Did you have -march=[something with at least SSE2 or -msse2?
Ok I just added an implementation that uses NEON and I see ~13% improvement when testing on an Odroid C2:
Before:
Benchmark result:
Took 29.264072 seconds to run
170857.968750 req/sec
After:
Benchmark result:
Took 25.583895 seconds to run
195435.453125 req/sec
FWIW here is my NEON patch on top of this PR as it currently stands:
--- http_parser.c.orig 2017-09-23 16:50:15.330221304 -0400
+++ http_parser.c 2017-09-24 02:01:58.228557556 -0400
@@ -443,7 +443,7 @@
} \
ch = *++p; \
++parser->nread; \
- } else \
+ }
/**
@@ -485,19 +485,17 @@
const char* find_crlf(const char* p, const char* data, size_t len);
-#if defined(_MSC_VER) // SSE2 is baseline
-#include <intrin.h>
-#define __builtin_ctz _tzcnt_u32
-#elif defined(__SSE2__) && defined(__GNUC__)
-#include <emmintrin.h>
-#endif
-
#ifndef USE_INTRISICS_CRLF
-# define USE_INTRISICS_CRLF defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__))
+# define USE_INTRISICS_CRLF 1
#endif
-#if USE_INTRISICS_CRLF
-
+#if USE_INTRISICS_CRLF && (defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__)))
+# if defined(_MSC_VER) // SSE2 is baseline
+# include <intrin.h>
+# define __builtin_ctz _tzcnt_u32
+# elif defined(__SSE2__) && defined(__GNUC__)
+# include <emmintrin.h>
+# endif
int32_t get_crlf_mask(const char* p) {
/* [ c, 0, 0, 0, 0, 0 .. 0 ] */
const __m128i vCR = _mm_set1_epi8(0x0a);
@@ -512,7 +510,37 @@
v2 = _mm_or_si128(v1, v2);
return _mm_movemask_epi8(v2);
}
+#elif USE_INTRISICS_CRLF && defined(__ARM_NEON)
+# include <arm_neon.h>
+uint16_t get_crlf_mask(const char* p) {
+ static const uint8_t __attribute__ ((aligned (16))) _powers_array[16] =
+ { 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
+ uint8x16_t _powers = vld1q_u8(_powers_array);
+
+ /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
+ const uint8x16_t vCR = vdupq_n_u8(0x0d);
+
+ /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
+ const uint8x16_t vLF = vdupq_n_u8(0x0a);
+
+ uint8x16_t v1, v2;
+ v1 = *(uint8x16_t*)p;
+ v2 = vceqq_u8(vCR, v1);
+ v1 = vceqq_u8(vLF, v1);
+ v2 = vorrq_u8(v1, v2);
+
+ // Compute the mask from the input
+ uint64x2_t mask = vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(v2, _powers))));
+
+ // Get the resulting bytes
+ uint16_t output;
+ vst1q_lane_u8((uint8_t*)&output + 0, (uint8x16_t)mask, 0);
+ vst1q_lane_u8((uint8_t*)&output + 1, (uint8x16_t)mask, 8);
+ return output;
+}
+#endif
+#if USE_INTRISICS_CRLF && ((defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__))) || defined(__ARM_NEON))
const char* find_crlf(const char* p, const char* data, size_t len) {
const char* lastp = MIN(data + len, HTTP_MAX_HEADER_SIZE + p);
@@ -525,7 +553,7 @@
result = (result >> alignment) << alignment;
if (!result) {
- while(!result && lastp >= p+32) {
+ while (!result && lastp >= p + 32) {
p += 32;
result = get_crlf_mask(p + 16) << 16 | get_crlf_mask(p);
}
@@ -539,9 +567,7 @@
}
return p;
}
-
#else
-
const char* find_crlf(const char* p, const char* data, size_t len) {
const char* p_cr;
const char* p_lf;
@@ -563,7 +589,6 @@
}
return p;
}
-
#endif
/* Our URL parser.
UPDATE: Patch updated. It turns out initializing a static SIMD array once vs. always creating it in the crlf-searching function makes no difference, so I've removed that extra bit of code. I've also factored out the common find_crlf() used by both SIMD-based implementations.
@Kronuz actually I have tested it on clang too, it works, maybe you can share which version you are using?
I've updated my patch a bit.
I'm also wondering if the non-SIMD implementation might benefit from a SWAR implementation over the current dual memchr() solution?
@mscdex I did try optimizing with packing into some 64bit and then comparing them, I don't think the improvements were significant, also given that memchr is always optimized for a specific platform, if my memory serves me right, it started losing out rather quickly.
I might be wrong though.
made some changes to ensure not going before or after requested bounds within findcrlf, did a bench, almost no drop in performance
hmm... I have added some tests a while ago, any comments? or anything else that I should change?
@shekhei It looks like there was a build error during the last CI run.
Ah, it seems like it is stricter with gcc, i have modified, lets see how's the build
Ok made some changes according to comments
Still green. LGTM if the suggestion made by @bnoordhuis about HTTP_MAX_HEADER_SIZE + p is not a blocker.
hmmmmm.... any news?