http-parser Various optimizations

Hi, basically I was just messing around and did some benchmarking, found some interesting things, so I made some changes. Mostly to do with removing some of the branches and created a SIMD version of finding CRLF that doesn't need to walk the whole payload twice(worst case)

Below are the benchmark results

OSX (el capitan, mbp 13" early 2015, 3.1 GHz Intel Core i7)

Clang

time(master)	req/s(master)	time(new)	req/s(new)	percent
5.761632	867809.687500	5.104409	979545.312500	11.4069
5.769115	866684.062500	5.104001	979623.625000	11.5289
5.803461	861554.875000	5.086198	983052.625000	12.3592
5.779534	865121.687500	5.082990	983672.937500	12.0519
5.762670	867653.375000	5.092542	981827.875000	11.6288

GCC 4.9

time(master)	req/s(master)	time(new)	req/s(new)	percent
4.297466	1163476.375000	3.787979	1319965.000000	11.8555
4.159397	1202097.250000	3.774015	1324849.000000	9.26533
4.230322	1181943.125000	3.788900	1319644.250000	10.4347
4.214539	1186369.375000	3.772407	1325413.750000	10.4906
4.162938	1201074.750000	3.763118	1328685.375000	9.6042

ubuntu 14.04 ( Intel Core i7-6700 @ 3.40GHz )

time(master)	req/s(master)	time(new)	req/s(new)	percent
3.186016	1569358.000000	2.815905	1775628.000000	11.6167
3.186449	1569144.875000	2.846761	1756382.125000	10.6604
3.190864	1566973.625000	2.877822	1737425.125000	9.81057
3.183230	1570731.625000	2.813425	1777193.250000	11.6173
3.190520	1567142.625000	2.849443	1754728.875000	10.6903

Oct 19 '16 12:10 shekhei

I am not sure how to ignore whitespace when you view the diff, the changes are actually very minimal

Oct 19 '16 12:10 shekhei

@shekhei you can use https://github.com/nodejs/http-parser/pull/342/files?w=1 to see the diff without whitespace changes.

BTW it looks like the style in this file is to use lower-case for labels, not upper-case as you have for HEADER_FIELD_BEGIN

Oct 19 '16 20:10 craig65535

@shekhei thank you for doing this! I've experimented with vector intrinsics some time before, but for whatever reason I had at the moment decided to not proceed further.

Is the most of the speedup achieved by using vector intrinsics? How much do other changes contribute?

Oct 19 '16 23:10 indutny

The fallthroughs and h_general jmp was about ~20-30%. the vector stuff really depends on the size of the headers and the builtin memchr variants.

I was really surprised about the gcc and clang difference though, if i use gcc 6 to compile it, my god its almost twice as fast as the clang one on osx.

Oct 20 '16 00:10 shekhei

@craig65535 you are right, let me modify them to lower case, and using snake for function names.

Oct 20 '16 02:10 shekhei

hmm.. any news on this pr?

Dec 21 '16 06:12 shekhei

hmm... :P

Jan 24 '17 14:01 shekhei

hmm... any news?

Aug 22 '17 15:08 shekhei

Eager to see this landed. I get about a 21% speedup on an i7-7700HQ on Windows.

Just tested a wider version (256 bits, requiring AVX2) and get another 8% on top of that. This SSE2-only version needs no dispatching though, which makes it easy. Happy to help with an AVX2 implementation (using either preprocessor directives or dispatching) if there's interest, though.

Sep 16 '17 21:09 zbjornson

Ok, it seems like my whole branch has gone cranky after not updating for so long, let me clean it up

Sep 19 '17 01:09 shekhei

Ok, latest changes made

Sep 19 '17 01:09 shekhei

@indutny What do you think of landing this awesome PR? Will be nice to have this core part of node.js sped up!

Sep 22 '17 17:09 zbjornson

This PR is awesome indeed!

I'll try to find time to review it this week. Sorry for not doing it sooner?

@bnoordhuis @mscdex your help would be very appreciated.

Sep 22 '17 18:09 indutny

It would be interesting to see how much of a difference there would be in using NEON on ARM for the parts that use intrinsics. I might try that this weekend.

Sep 22 '17 20:09 mscdex

Clang doesn’t define __GNUC__, yet it does support emmintrin.h, please we need to remove the __GNUC__ checks and just leave __SSE2__ to make this code efficient also with clang.

Sep 23 '17 15:09 Kronuz

Clang does define __GNUC__ actually. (Have you found some version that doesn't?)

Sep 23 '17 15:09 zbjornson

Well, here it wasn’t compiling that part and it was because of that __GNUC__ check. I'll dig deeper.

Sep 23 '17 17:09 Kronuz

Did you have -march=[something with at least SSE2 or -msse2?

Sep 23 '17 17:09 zbjornson

Ok I just added an implementation that uses NEON and I see ~13% improvement when testing on an Odroid C2:

Before:

Benchmark result:
Took 29.264072 seconds to run
170857.968750 req/sec

After:

Benchmark result:
Took 25.583895 seconds to run
195435.453125 req/sec

Sep 23 '17 20:09 mscdex

FWIW here is my NEON patch on top of this PR as it currently stands:

--- http_parser.c.orig	2017-09-23 16:50:15.330221304 -0400
+++ http_parser.c	2017-09-24 02:01:58.228557556 -0400
@@ -443,7 +443,7 @@
     }                                                         \
     ch = *++p;                                                \
     ++parser->nread;                                          \
-  } else                                                      \
+  }
 
 
 /**
@@ -485,19 +485,17 @@
 
 const char* find_crlf(const char* p, const char* data, size_t len);
 
-#if defined(_MSC_VER) // SSE2 is baseline
-#include <intrin.h>
-#define __builtin_ctz _tzcnt_u32
-#elif defined(__SSE2__) && defined(__GNUC__)
-#include <emmintrin.h>
-#endif
-
 #ifndef USE_INTRISICS_CRLF
-# define USE_INTRISICS_CRLF defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__))
+# define USE_INTRISICS_CRLF 1
 #endif
 
-#if USE_INTRISICS_CRLF
-
+#if USE_INTRISICS_CRLF && (defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__)))
+# if defined(_MSC_VER) // SSE2 is baseline
+#  include <intrin.h>
+#  define __builtin_ctz _tzcnt_u32
+# elif defined(__SSE2__) && defined(__GNUC__)
+#  include <emmintrin.h>
+# endif
 int32_t get_crlf_mask(const char* p) {
   /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
   const __m128i vCR = _mm_set1_epi8(0x0a);
@@ -512,7 +510,37 @@
   v2 = _mm_or_si128(v1, v2);
   return _mm_movemask_epi8(v2);
 }
+#elif USE_INTRISICS_CRLF && defined(__ARM_NEON)
+# include <arm_neon.h>
+uint16_t get_crlf_mask(const char* p) {
+  static const uint8_t __attribute__ ((aligned (16))) _powers_array[16] =
+    { 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
+  uint8x16_t _powers = vld1q_u8(_powers_array);
+
+  /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
+  const uint8x16_t vCR = vdupq_n_u8(0x0d);
+
+  /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
+  const uint8x16_t vLF = vdupq_n_u8(0x0a);
+
+  uint8x16_t v1, v2;
+  v1 = *(uint8x16_t*)p;
+  v2 = vceqq_u8(vCR, v1);
+  v1 = vceqq_u8(vLF, v1);
+  v2 = vorrq_u8(v1, v2);
+
+  // Compute the mask from the input
+  uint64x2_t mask = vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(v2, _powers))));
+
+  // Get the resulting bytes
+  uint16_t output;
+  vst1q_lane_u8((uint8_t*)&output + 0, (uint8x16_t)mask, 0);
+  vst1q_lane_u8((uint8_t*)&output + 1, (uint8x16_t)mask, 8);
 
+  return output;
+}
+#endif
+#if USE_INTRISICS_CRLF && ((defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__))) || defined(__ARM_NEON))
 const char* find_crlf(const char* p, const char* data, size_t len) {
   const char* lastp = MIN(data + len, HTTP_MAX_HEADER_SIZE + p);
 
@@ -525,7 +553,7 @@
   result = (result >> alignment) << alignment;
 
   if (!result) {
-    while(!result && lastp >= p+32) {
+    while (!result && lastp >= p + 32) {
       p += 32;
       result = get_crlf_mask(p + 16) << 16 | get_crlf_mask(p);
     }
@@ -539,9 +567,7 @@
   }
   return p;
 }
-
 #else
-
 const char* find_crlf(const char* p, const char* data, size_t len) {
   const char* p_cr;
   const char* p_lf;
@@ -563,7 +589,6 @@
   }
   return p;
 }
-
 #endif
 
 /* Our URL parser.

UPDATE: Patch updated. It turns out initializing a static SIMD array once vs. always creating it in the crlf-searching function makes no difference, so I've removed that extra bit of code. I've also factored out the common find_crlf() used by both SIMD-based implementations.

Sep 23 '17 20:09 mscdex

@Kronuz actually I have tested it on clang too, it works, maybe you can share which version you are using?

Sep 24 '17 01:09 shekhei

I've updated my patch a bit.

I'm also wondering if the non-SIMD implementation might benefit from a SWAR implementation over the current dual memchr() solution?

Sep 24 '17 06:09 mscdex

@mscdex I did try optimizing with packing into some 64bit and then comparing them, I don't think the improvements were significant, also given that memchr is always optimized for a specific platform, if my memory serves me right, it started losing out rather quickly.

I might be wrong though.

Sep 24 '17 11:09 shekhei

made some changes to ensure not going before or after requested bounds within findcrlf, did a bench, almost no drop in performance

Sep 25 '17 16:09 shekhei

hmm... I have added some tests a while ago, any comments? or anything else that I should change?

Oct 18 '17 13:10 shekhei

@shekhei It looks like there was a build error during the last CI run.

Oct 18 '17 14:10 mscdex

Ah, it seems like it is stricter with gcc, i have modified, lets see how's the build

Oct 19 '17 02:10 shekhei

Ok made some changes according to comments

Oct 19 '17 13:10 shekhei

Still green. LGTM if the suggestion made by @bnoordhuis about HTTP_MAX_HEADER_SIZE + p is not a blocker.

Oct 19 '17 14:10 mscdex

hmmmmm.... any news?

Jan 08 '18 03:01 shekhei