http-parser icon indicating copy to clipboard operation
http-parser copied to clipboard

Various optimizations

Open shekhei opened this issue 9 years ago • 35 comments

Hi, basically I was just messing around and did some benchmarking, found some interesting things, so I made some changes. Mostly to do with removing some of the branches and created a SIMD version of finding CRLF that doesn't need to walk the whole payload twice(worst case)

Below are the benchmark results

OSX (el capitan, mbp 13" early 2015, 3.1 GHz Intel Core i7)

Clang

time(master) req/s(master) time(new) req/s(new) percent
5.761632 867809.687500 5.104409 979545.312500 11.4069
5.769115 866684.062500 5.104001 979623.625000 11.5289
5.803461 861554.875000 5.086198 983052.625000 12.3592
5.779534 865121.687500 5.082990 983672.937500 12.0519
5.762670 867653.375000 5.092542 981827.875000 11.6288

GCC 4.9

time(master) req/s(master) time(new) req/s(new) percent
4.297466 1163476.375000 3.787979 1319965.000000 11.8555
4.159397 1202097.250000 3.774015 1324849.000000 9.26533
4.230322 1181943.125000 3.788900 1319644.250000 10.4347
4.214539 1186369.375000 3.772407 1325413.750000 10.4906
4.162938 1201074.750000 3.763118 1328685.375000 9.6042

ubuntu 14.04 ( Intel Core i7-6700 @ 3.40GHz )

time(master) req/s(master) time(new) req/s(new) percent
3.186016 1569358.000000 2.815905 1775628.000000 11.6167
3.186449 1569144.875000 2.846761 1756382.125000 10.6604
3.190864 1566973.625000 2.877822 1737425.125000 9.81057
3.183230 1570731.625000 2.813425 1777193.250000 11.6173
3.190520 1567142.625000 2.849443 1754728.875000 10.6903

shekhei avatar Oct 19 '16 12:10 shekhei

I am not sure how to ignore whitespace when you view the diff, the changes are actually very minimal

shekhei avatar Oct 19 '16 12:10 shekhei

@shekhei you can use https://github.com/nodejs/http-parser/pull/342/files?w=1 to see the diff without whitespace changes.

BTW it looks like the style in this file is to use lower-case for labels, not upper-case as you have for HEADER_FIELD_BEGIN

craig65535 avatar Oct 19 '16 20:10 craig65535

@shekhei thank you for doing this! I've experimented with vector intrinsics some time before, but for whatever reason I had at the moment decided to not proceed further.

Is the most of the speedup achieved by using vector intrinsics? How much do other changes contribute?

indutny avatar Oct 19 '16 23:10 indutny

The fallthroughs and h_general jmp was about ~20-30%. the vector stuff really depends on the size of the headers and the builtin memchr variants.

I was really surprised about the gcc and clang difference though, if i use gcc 6 to compile it, my god its almost twice as fast as the clang one on osx.

shekhei avatar Oct 20 '16 00:10 shekhei

@craig65535 you are right, let me modify them to lower case, and using snake for function names.

shekhei avatar Oct 20 '16 02:10 shekhei

hmm.. any news on this pr?

shekhei avatar Dec 21 '16 06:12 shekhei

hmm... :P

shekhei avatar Jan 24 '17 14:01 shekhei

hmm... any news?

shekhei avatar Aug 22 '17 15:08 shekhei

Eager to see this landed. I get about a 21% speedup on an i7-7700HQ on Windows.

Just tested a wider version (256 bits, requiring AVX2) and get another 8% on top of that. This SSE2-only version needs no dispatching though, which makes it easy. Happy to help with an AVX2 implementation (using either preprocessor directives or dispatching) if there's interest, though.

zbjornson avatar Sep 16 '17 21:09 zbjornson

Ok, it seems like my whole branch has gone cranky after not updating for so long, let me clean it up

shekhei avatar Sep 19 '17 01:09 shekhei

Ok, latest changes made

shekhei avatar Sep 19 '17 01:09 shekhei

@indutny What do you think of landing this awesome PR? Will be nice to have this core part of node.js sped up!

zbjornson avatar Sep 22 '17 17:09 zbjornson

This PR is awesome indeed!

I'll try to find time to review it this week. Sorry for not doing it sooner?

@bnoordhuis @mscdex your help would be very appreciated.

indutny avatar Sep 22 '17 18:09 indutny

It would be interesting to see how much of a difference there would be in using NEON on ARM for the parts that use intrinsics. I might try that this weekend.

mscdex avatar Sep 22 '17 20:09 mscdex

Clang doesn’t define __GNUC__, yet it does support emmintrin.h, please we need to remove the __GNUC__ checks and just leave __SSE2__ to make this code efficient also with clang.

Kronuz avatar Sep 23 '17 15:09 Kronuz

Clang does define __GNUC__ actually. (Have you found some version that doesn't?)

zbjornson avatar Sep 23 '17 15:09 zbjornson

Well, here it wasn’t compiling that part and it was because of that __GNUC__ check. I'll dig deeper.

Kronuz avatar Sep 23 '17 17:09 Kronuz

Did you have -march=[something with at least SSE2 or -msse2?

zbjornson avatar Sep 23 '17 17:09 zbjornson

Ok I just added an implementation that uses NEON and I see ~13% improvement when testing on an Odroid C2:

Before:

Benchmark result:
Took 29.264072 seconds to run
170857.968750 req/sec

After:

Benchmark result:
Took 25.583895 seconds to run
195435.453125 req/sec

mscdex avatar Sep 23 '17 20:09 mscdex

FWIW here is my NEON patch on top of this PR as it currently stands:

--- http_parser.c.orig	2017-09-23 16:50:15.330221304 -0400
+++ http_parser.c	2017-09-24 02:01:58.228557556 -0400
@@ -443,7 +443,7 @@
     }                                                         \
     ch = *++p;                                                \
     ++parser->nread;                                          \
-  } else                                                      \
+  }
 
 
 /**
@@ -485,19 +485,17 @@
 
 const char* find_crlf(const char* p, const char* data, size_t len);
 
-#if defined(_MSC_VER) // SSE2 is baseline
-#include <intrin.h>
-#define __builtin_ctz _tzcnt_u32
-#elif defined(__SSE2__) && defined(__GNUC__)
-#include <emmintrin.h>
-#endif
-
 #ifndef USE_INTRISICS_CRLF
-# define USE_INTRISICS_CRLF defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__))
+# define USE_INTRISICS_CRLF 1
 #endif
 
-#if USE_INTRISICS_CRLF
-
+#if USE_INTRISICS_CRLF && (defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__)))
+# if defined(_MSC_VER) // SSE2 is baseline
+#  include <intrin.h>
+#  define __builtin_ctz _tzcnt_u32
+# elif defined(__SSE2__) && defined(__GNUC__)
+#  include <emmintrin.h>
+# endif
 int32_t get_crlf_mask(const char* p) {
   /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
   const __m128i vCR = _mm_set1_epi8(0x0a);
@@ -512,7 +510,37 @@
   v2 = _mm_or_si128(v1, v2);
   return _mm_movemask_epi8(v2);
 }
+#elif USE_INTRISICS_CRLF && defined(__ARM_NEON)
+# include <arm_neon.h>
+uint16_t get_crlf_mask(const char* p) {
+  static const uint8_t __attribute__ ((aligned (16))) _powers_array[16] =
+    { 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };
+  uint8x16_t _powers = vld1q_u8(_powers_array);
+
+  /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
+  const uint8x16_t vCR = vdupq_n_u8(0x0d);
+
+  /* [ c, 0, 0, 0, 0, 0 .. 0 ] */
+  const uint8x16_t vLF = vdupq_n_u8(0x0a);
+
+  uint8x16_t v1, v2;
+  v1 = *(uint8x16_t*)p;
+  v2 = vceqq_u8(vCR, v1);
+  v1 = vceqq_u8(vLF, v1);
+  v2 = vorrq_u8(v1, v2);
+
+  // Compute the mask from the input
+  uint64x2_t mask = vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(v2, _powers))));
+
+  // Get the resulting bytes
+  uint16_t output;
+  vst1q_lane_u8((uint8_t*)&output + 0, (uint8x16_t)mask, 0);
+  vst1q_lane_u8((uint8_t*)&output + 1, (uint8x16_t)mask, 8);
 
+  return output;
+}
+#endif
+#if USE_INTRISICS_CRLF && ((defined(_MSC_VER) || (defined(__SSE2__) && defined(__GNUC__))) || defined(__ARM_NEON))
 const char* find_crlf(const char* p, const char* data, size_t len) {
   const char* lastp = MIN(data + len, HTTP_MAX_HEADER_SIZE + p);
 
@@ -525,7 +553,7 @@
   result = (result >> alignment) << alignment;
 
   if (!result) {
-    while(!result && lastp >= p+32) {
+    while (!result && lastp >= p + 32) {
       p += 32;
       result = get_crlf_mask(p + 16) << 16 | get_crlf_mask(p);
     }
@@ -539,9 +567,7 @@
   }
   return p;
 }
-
 #else
-
 const char* find_crlf(const char* p, const char* data, size_t len) {
   const char* p_cr;
   const char* p_lf;
@@ -563,7 +589,6 @@
   }
   return p;
 }
-
 #endif
 
 /* Our URL parser.

UPDATE: Patch updated. It turns out initializing a static SIMD array once vs. always creating it in the crlf-searching function makes no difference, so I've removed that extra bit of code. I've also factored out the common find_crlf() used by both SIMD-based implementations.

mscdex avatar Sep 23 '17 20:09 mscdex

@Kronuz actually I have tested it on clang too, it works, maybe you can share which version you are using?

shekhei avatar Sep 24 '17 01:09 shekhei

I've updated my patch a bit.

I'm also wondering if the non-SIMD implementation might benefit from a SWAR implementation over the current dual memchr() solution?

mscdex avatar Sep 24 '17 06:09 mscdex

@mscdex I did try optimizing with packing into some 64bit and then comparing them, I don't think the improvements were significant, also given that memchr is always optimized for a specific platform, if my memory serves me right, it started losing out rather quickly.

I might be wrong though.

shekhei avatar Sep 24 '17 11:09 shekhei

made some changes to ensure not going before or after requested bounds within findcrlf, did a bench, almost no drop in performance

shekhei avatar Sep 25 '17 16:09 shekhei

hmm... I have added some tests a while ago, any comments? or anything else that I should change?

shekhei avatar Oct 18 '17 13:10 shekhei

@shekhei It looks like there was a build error during the last CI run.

mscdex avatar Oct 18 '17 14:10 mscdex

Ah, it seems like it is stricter with gcc, i have modified, lets see how's the build

shekhei avatar Oct 19 '17 02:10 shekhei

Ok made some changes according to comments

shekhei avatar Oct 19 '17 13:10 shekhei

Still green. LGTM if the suggestion made by @bnoordhuis about HTTP_MAX_HEADER_SIZE + p is not a blocker.

mscdex avatar Oct 19 '17 14:10 mscdex

hmmmmm.... any news?

shekhei avatar Jan 08 '18 03:01 shekhei