bytestring
bytestring copied to clipboard
Avoid per-byte loop in cstring{,Utf8} builders
Copy chunks of the input to the output buffer with 'memcpy', up to the shorter of the available buffer space and the "null-free" portion of the remaining string. For the UTF8 version, encoded NUL bytes are located via strstr(3).
The emulated CI build failures are spurious/systemic, not related to the PR.
If I add a couple of new benchmarks that use somewhat longer string literals in builders:
--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
, benchB' "UTF-8 String" () $ \() -> P.cstringUtf8 "hello world\0"#
, benchB' "String (naive)" "hello world!" fromString
, benchB' "String" () $ \() -> P.cstring "hello world!"#
+ , benchB' "AsciiLit64" () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+ , benchB' "Utf8Lit64" () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
]
, bgroup "Encoding wrappers"
The relevant benchmark results (GHC 9.4.5) are:
$ cabal run bytestring-bench -- --baseline baseline-lit-9.4.csv --csv new-lit-9.4.csv -p '/Lit64/'
Up to date
All
Data.ByteString.Builder
Small payload
AsciiLit64: OK (1.43s)
278 ns ± 19 ns, 66% less than baseline
Utf8Lit64: OK (1.72s)
356 ns ± 23 ns, 58% less than baseline
All 2 tests passed (3.19s)
The baseline master branch run was:
$ cabal run bytestring-bench -- --csv baseline-lit-9.4.csv -p '/Lit64/'
Up to date
All
Data.ByteString.Builder
Small payload
AsciiLit64: OK (1.07s)
832 ns ± 79 ns
Utf8Lit64: OK (1.06s)
846 ns ± 75 ns
All 2 tests passed (2.16s)
Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0.
I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass.
Your cstring_step does more or less the same thing as byteStringCopyStep in Builder.Internal.
I will take a closer look later.
nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string
nitpick: could
Ptr "\xc0\x80"#be some top-level constant? it's used in two places and is kind of a "magic" string
Sure. Done. I do hope we won't forget to squash before merging...
If there's anything further I need to do, please let me know...
@vdukhovni please rebase to trigger updated CI jobs.
@vdukhovni please rebase to trigger updated CI jobs.
Done.
ping @vdukhovni
Do you plan to come back to this patch? Would you like to pass this off to a maintainer?
ping @vdukhovni
Do you plan to come back to this patch? Would you like to pass this off to a maintainer?
It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work?
Perhaps I can get this over the line. What remains to be done?
This PR is languishing. Where do we go from here?
The main questions I had were the ones I raised in this round of review. I've just started to look into them myself since I'd really like this patch to land eventually.
Another idea that has since occurred to me is that since 0xC0 never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence using memchr instead of strstr.
Another idea that has since occurred to me is that since
0xC0never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence usingmemchrinstead ofstrstr.
Perhaps, though one might expect that an optimised C-library strstr already does the equivalent of memchr to find the start of a potential match... And indeed that's what happens in the glibc implementation
Heads-up: I'll probably push some updates and finishing touches to this branch later today or tomorrow.
The small-builder benchmarks were set up in a terrible way that made using them to investigate performance very difficult. My recent push should hopefully fix that.
Here's what the benchmarks currently look like on my machine with ghc-9.8.1:
Baseline (3ce03463fc26c4b8d96b1f50e4acec0cfe0e4d1b):
All
Data.ByteString.Builder
Small payload
mempty: OK
15.7 ns ± 810 ps
toLazyByteString mempty: OK
452 ns ± 24 ns
empty (10000 times): OK
126 μs ± 4.0 μs
ensureFree 8: OK
16.6 ns ± 624 ps
intHost 1: OK
25.7 ns ± 1.2 ns
UTF-8 String (12B, naive): OK
101 ns ± 1.6 ns
UTF-8 String (12B): OK
104 ns ± 35 ns
UTF-8 String (64B, naive): OK
311 ns ± 16 ns
UTF-8 String (64B): OK
356 ns ± 9.7 ns
UTF-8 String (64B, half nulls): OK
501 ns ± 15 ns
UTF-8 String (64B, all nulls): OK
335 ns ± 12 ns
String (12B, naive): OK
122 ns ± 2.5 ns
String (12B): OK
86.8 ns ± 3.1 ns
String (64B, naive): OK
327 ns ± 17 ns
String (64B): OK
279 ns ± 11 ns
Topic (26030097e5d66077dd078b2dc8ffcf0e2075d805):
All
Data.ByteString.Builder
Small payload
mempty: OK
15.5 ns ± 830 ps, same as baseline
toLazyByteString mempty: OK
452 ns ± 27 ns, same as baseline
empty (10000 times): OK
133 μs ± 5.8 μs, 5% more than baseline
ensureFree 8: OK
16.9 ns ± 844 ps, same as baseline
intHost 1: OK
25.5 ns ± 818 ps, same as baseline
UTF-8 String (12B, naive): OK
489 ns ± 29 ns, 383% more than baseline
UTF-8 String (12B): OK
61.2 ns ± 1.5 ns, 41% less than baseline
UTF-8 String (64B, naive): OK
2.36 μs ± 82 ns, 657% more than baseline
UTF-8 String (64B): OK
61.4 ns ± 3.0 ns, 82% less than baseline
UTF-8 String (64B, half nulls): OK
563 ns ± 21 ns, 12% more than baseline
UTF-8 String (64B, all nulls): OK
765 ns ± 24 ns, 128% more than baseline
String (12B, naive): OK
499 ns ± 12 ns, 310% more than baseline
String (12B): OK
24.7 ns ± 3.3 ns, 71% less than baseline
String (64B, naive): OK
2.37 μs ± 108 ns, 623% more than baseline
String (64B): OK
23.3 ns ± 1.3 ns, 91% less than baseline
Lots of big changes. Some are expected:
- The ASCII
memcpyimplementation is much faster except perhaps on trivially small strings: ~70% less run-time on "hello world!" and ~90% less run-time on the 64-byte case. - For modified UTF-8 strings without many embedded nulls, the new implementation is much faster. But it suffers when there are many embedded nulls, roughly breaking even at half-nulls and taking ~2.3x as long when there are only nulls. (Perhaps it would make sense to directly check if the first byte is 0xC0 before calling the C search function, to reduce this regression's magnitude a little.)
But there's also a nasty surprise:
- The benchmarks for the "naive" case (where rewriting to the
CString/Add#versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new{-# NOINLINE stringUtf8 #-}theprimMapListBoundedinstringUtf8only gets one argument andprimMapListBoundedneeds two arguments to inline. Ugh! I'll try reducing the syntactic arity ofprimMapListBoundedand see if that fixes this.
I also wanted to see how the memchr implementation compares with the strstr implementation. It seems they're about the same. Here are the memchr numbers, with strstr as the "baseline":
All
Data.ByteString.Builder
Small payload
mempty: OK
15.4 ns ± 456 ps, same as baseline
toLazyByteString mempty: OK
445 ns ± 21 ns, same as baseline
empty (10000 times): OK
124 μs ± 6.3 μs, 6% less than baseline
ensureFree 8: OK
17.7 ns ± 658 ps, same as baseline
intHost 1: OK
25.8 ns ± 484 ps, same as baseline
UTF-8 String (12B, naive): OK
490 ns ± 11 ns, same as baseline
UTF-8 String (12B): OK
62.7 ns ± 1.6 ns, same as baseline
UTF-8 String (64B, naive): OK
2.40 μs ± 84 ns, same as baseline
UTF-8 String (64B): OK
64.4 ns ± 1.7 ns, same as baseline
UTF-8 String (64B, half nulls): OK
616 ns ± 24 ns, 9% more than baseline
UTF-8 String (64B, all nulls): OK
839 ns ± 349 ns, same as baseline
String (12B, naive): OK
500 ns ± 16 ns, same as baseline
String (12B): OK
26.3 ns ± 356 ps, same as baseline
String (64B, naive): OK
2.28 μs ± 68 ns, same as baseline
String (64B): OK
22.1 ns ± 752 ps, same as baseline
* The benchmarks for the "naive" case (where rewriting to the `CString`/`Add#` versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new `{-# NOINLINE stringUtf8 #-}` the `primMapListBounded` in `stringUtf8` only gets one argument and `primMapListBounded` needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of `primMapListBounded` and see if that fixes this.
I have confirmed that reducing the syntactic arity of primMapListBounded fixes this regression.