bytestring icon indicating copy to clipboard operation
bytestring copied to clipboard

Avoid per-byte loop in cstring{,Utf8} builders

Open vdukhovni opened this issue 2 years ago • 18 comments

Copy chunks of the input to the output buffer with 'memcpy', up to the shorter of the available buffer space and the "null-free" portion of the remaining string. For the UTF8 version, encoded NUL bytes are located via strstr(3).

vdukhovni avatar Jan 13 '23 04:01 vdukhovni

The emulated CI build failures are spurious/systemic, not related to the PR.

If I add a couple of new benchmarks that use somewhat longer string literals in builders:

--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
         , benchB' "UTF-8 String"  () $ \() -> P.cstringUtf8 "hello world\0"#
         , benchB' "String (naive)" "hello world!" fromString
         , benchB' "String"        () $ \() -> P.cstring "hello world!"#
+        , benchB' "AsciiLit64"   () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+        , benchB' "Utf8Lit64"   () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
         ]
 
       , bgroup "Encoding wrappers"

The relevant benchmark results (GHC 9.4.5) are:

$ cabal run bytestring-bench -- --baseline baseline-lit-9.4.csv --csv new-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.43s)
        278  ns ±  19 ns, 66% less than baseline
      Utf8Lit64:  OK (1.72s)
        356  ns ±  23 ns, 58% less than baseline

All 2 tests passed (3.19s)

The baseline master branch run was:

$ cabal run bytestring-bench -- --csv baseline-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.07s)
        832  ns ±  79 ns
      Utf8Lit64:  OK (1.06s)
        846  ns ±  75 ns

All 2 tests passed (2.16s)

vdukhovni avatar Jan 13 '23 06:01 vdukhovni

Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0.

I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass.

Your cstring_step does more or less the same thing as byteStringCopyStep in Builder.Internal.

I will take a closer look later.

clyring avatar Jan 13 '23 13:01 clyring

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

chessai avatar Jan 15 '23 17:01 chessai

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

Sure. Done. I do hope we won't forget to squash before merging...

vdukhovni avatar Jan 15 '23 19:01 vdukhovni

If there's anything further I need to do, please let me know...

vdukhovni avatar Jan 23 '23 06:01 vdukhovni

@vdukhovni please rebase to trigger updated CI jobs.

Bodigrim avatar Feb 08 '23 23:02 Bodigrim

@vdukhovni please rebase to trigger updated CI jobs.

Done.

vdukhovni avatar Feb 09 '23 04:02 vdukhovni

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

clyring avatar Sep 27 '23 01:09 clyring

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work?

vdukhovni avatar Sep 27 '23 02:09 vdukhovni

Perhaps I can get this over the line. What remains to be done?

vdukhovni avatar Oct 11 '23 02:10 vdukhovni

This PR is languishing. Where do we go from here?

vdukhovni avatar Jan 21 '24 00:01 vdukhovni

The main questions I had were the ones I raised in this round of review. I've just started to look into them myself since I'd really like this patch to land eventually.

Another idea that has since occurred to me is that since 0xC0 never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence using memchr instead of strstr.

clyring avatar Jan 21 '24 00:01 clyring

Another idea that has since occurred to me is that since 0xC0 never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence using memchr instead of strstr.

Perhaps, though one might expect that an optimised C-library strstr already does the equivalent of memchr to find the start of a potential match... And indeed that's what happens in the glibc implementation

vdukhovni avatar Jan 21 '24 16:01 vdukhovni

Heads-up: I'll probably push some updates and finishing touches to this branch later today or tomorrow.

clyring avatar Feb 11 '24 16:02 clyring

The small-builder benchmarks were set up in a terrible way that made using them to investigate performance very difficult. My recent push should hopefully fix that.

clyring avatar Feb 15 '24 04:02 clyring

Here's what the benchmarks currently look like on my machine with ghc-9.8.1:

Baseline (3ce03463fc26c4b8d96b1f50e4acec0cfe0e4d1b):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.7 ns ± 810 ps
      toLazyByteString mempty:        OK
        452  ns ±  24 ns
      empty (10000 times):            OK
        126  μs ± 4.0 μs
      ensureFree 8:                   OK
        16.6 ns ± 624 ps
      intHost 1:                      OK
        25.7 ns ± 1.2 ns
      UTF-8 String (12B, naive):      OK
        101  ns ± 1.6 ns
      UTF-8 String (12B):             OK
        104  ns ±  35 ns
      UTF-8 String (64B, naive):      OK
        311  ns ±  16 ns
      UTF-8 String (64B):             OK
        356  ns ± 9.7 ns
      UTF-8 String (64B, half nulls): OK
        501  ns ±  15 ns
      UTF-8 String (64B, all nulls):  OK
        335  ns ±  12 ns
      String (12B, naive):            OK
        122  ns ± 2.5 ns
      String (12B):                   OK
        86.8 ns ± 3.1 ns
      String (64B, naive):            OK
        327  ns ±  17 ns
      String (64B):                   OK
        279  ns ±  11 ns

Topic (26030097e5d66077dd078b2dc8ffcf0e2075d805):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.5 ns ± 830 ps,       same as baseline
      toLazyByteString mempty:        OK
        452  ns ±  27 ns,       same as baseline
      empty (10000 times):            OK
        133  μs ± 5.8 μs,  5% more than baseline
      ensureFree 8:                   OK
        16.9 ns ± 844 ps,       same as baseline
      intHost 1:                      OK
        25.5 ns ± 818 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        489  ns ±  29 ns, 383% more than baseline
      UTF-8 String (12B):             OK
        61.2 ns ± 1.5 ns, 41% less than baseline
      UTF-8 String (64B, naive):      OK
        2.36 μs ±  82 ns, 657% more than baseline
      UTF-8 String (64B):             OK
        61.4 ns ± 3.0 ns, 82% less than baseline
      UTF-8 String (64B, half nulls): OK
        563  ns ±  21 ns, 12% more than baseline
      UTF-8 String (64B, all nulls):  OK
        765  ns ±  24 ns, 128% more than baseline
      String (12B, naive):            OK
        499  ns ±  12 ns, 310% more than baseline
      String (12B):                   OK
        24.7 ns ± 3.3 ns, 71% less than baseline
      String (64B, naive):            OK
        2.37 μs ± 108 ns, 623% more than baseline
      String (64B):                   OK
        23.3 ns ± 1.3 ns, 91% less than baseline

Lots of big changes. Some are expected:

  • The ASCII memcpy implementation is much faster except perhaps on trivially small strings: ~70% less run-time on "hello world!" and ~90% less run-time on the 64-byte case.
  • For modified UTF-8 strings without many embedded nulls, the new implementation is much faster. But it suffers when there are many embedded nulls, roughly breaking even at half-nulls and taking ~2.3x as long when there are only nulls. (Perhaps it would make sense to directly check if the first byte is 0xC0 before calling the C search function, to reduce this regression's magnitude a little.)

But there's also a nasty surprise:

  • The benchmarks for the "naive" case (where rewriting to the CString/Add# versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new {-# NOINLINE stringUtf8 #-} the primMapListBounded in stringUtf8 only gets one argument and primMapListBounded needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of primMapListBounded and see if that fixes this.

I also wanted to see how the memchr implementation compares with the strstr implementation. It seems they're about the same. Here are the memchr numbers, with strstr as the "baseline":

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.4 ns ± 456 ps,       same as baseline
      toLazyByteString mempty:        OK
        445  ns ±  21 ns,       same as baseline
      empty (10000 times):            OK
        124  μs ± 6.3 μs,  6% less than baseline
      ensureFree 8:                   OK
        17.7 ns ± 658 ps,       same as baseline
      intHost 1:                      OK
        25.8 ns ± 484 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        490  ns ±  11 ns,       same as baseline
      UTF-8 String (12B):             OK
        62.7 ns ± 1.6 ns,       same as baseline
      UTF-8 String (64B, naive):      OK
        2.40 μs ±  84 ns,       same as baseline
      UTF-8 String (64B):             OK
        64.4 ns ± 1.7 ns,       same as baseline
      UTF-8 String (64B, half nulls): OK
        616  ns ±  24 ns,  9% more than baseline
      UTF-8 String (64B, all nulls):  OK
        839  ns ± 349 ns,       same as baseline
      String (12B, naive):            OK
        500  ns ±  16 ns,       same as baseline
      String (12B):                   OK
        26.3 ns ± 356 ps,       same as baseline
      String (64B, naive):            OK
        2.28 μs ±  68 ns,       same as baseline
      String (64B):                   OK
        22.1 ns ± 752 ps,       same as baseline

clyring avatar Feb 15 '24 14:02 clyring

* The benchmarks for the "naive" case (where rewriting to the `CString`/`Add#` versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new `{-# NOINLINE stringUtf8 #-}` the `primMapListBounded` in `stringUtf8` only gets one argument and `primMapListBounded` needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of `primMapListBounded` and see if that fixes this.

I have confirmed that reducing the syntactic arity of primMapListBounded fixes this regression.

clyring avatar Feb 15 '24 14:02 clyring