bytestring icon indicating copy to clipboard operation
bytestring copied to clipboard

Ideas for better bytestring builder

Open fumieval opened this issue 4 years ago • 17 comments

As https://github.com/haskell-perf/strict-bytestring-builders pointed out, current Data.ByteString.Builder tends to be slow especially when primitives are small.

Some alternatives have good benchmark results (cf https://www.reddit.com/r/haskell/comments/e6yg76/mason_faster_bytestring_builder/ ,https://github.com/fumieval/mason#performance, http://hackage.haskell.org/package/fast-builder); we may be able to take some tricks from those.

It also makes sense to expand the API so that

  • Can build strict ByteString directly
  • return the length after hPutBuilder

I think it's important that the internal API are exposed. At the moment it's difficult to send the contents of Builder over a socket due to BuildSignal being opaque.

fumieval avatar Dec 18 '19 08:12 fumieval

@fumieval I've seen your work and I've been meaning to reach out to you about it; it does look promising but I'd like to better understand its trade-offs; is it really strictly superior in all cases, or can you point out pathological cases where the current builder would be significantly (which may not necessarily be blockers) at a disadvantage?

Is it possible to switch to your implementation while being 100% API compatible with the old impl? (i.e. same type-sigs; no observable semantic differences )

hvr avatar Dec 19 '19 07:12 hvr

bytestring's Builder tends to be slow when working with small primitives (e.g. Word8). From my small benchmark:

bytestring                               mean 247.6 ns  ( +- 12.06 ns  )
bytestring-builder                       mean 188.7 ns  ( +- 34.65 ns  )
bytestring-builder/toStrict              mean 166.8 ns  ( +- 3.603 ns  )
fast-builder/lazy                        mean 67.43 ns  ( +- 20.60 ns  )
fast-builder/strict                      mean 41.29 ns  ( +- 2.150 ns  )
mason/lazy                               mean 378.2 ns  ( +- 5.022 ns  )
mason/strict                             mean 30.44 ns  ( +- 916.5 ps  )

AFAIK both fast-builder and mason have 100% compatible API. However, fast-builder has a [caveat] (http://hackage.haskell.org/package/fast-builder-0.1.2.0/docs/Data-ByteString-FastBuilder.html#v:toLazyByteString) about the performance of toLazyByteString and mason's toLazyByteString, implemented in a similar way, is still poor in some cases.

fumieval avatar Dec 21 '19 03:12 fumieval

Hrm; about the performance penalty of toLazyByteString; do you know why it regresses? which cases specifically suffer most? (i.e. are those pathological, or are those cases actually relevant in real-world cases?)

hvr avatar Jan 12 '20 09:01 hvr

Because toLazyByteString forks another thread to run the builder action, and the caller thread retrieves the chunks it yields via MVar. fast-builder avoids this problem by delaying forking until it reaches 32K bytes.

fumieval avatar Jan 12 '20 11:01 fumieval

@fumieval so are you saying that the difference/penalty is mostly noticeable for small-ish serializations (i.e. in the <= 32KiB range)?

hvr avatar Jan 12 '20 12:01 hvr

Yes, currently it has a big (constant?) overhead. If the whole process takes more than 100μs, I think mason gets faster than others (as shown in https://github.com/fumieval/mason#performance) in most cases.

fumieval avatar Jan 13 '20 00:01 fumieval

@fumieval what if the caller could provide a size-hint of the expected output buffer size (i.e. an integer argument)? would it then be possible to mitigate the constant-overhead and get on par with the current builders performance?

hvr avatar Jan 13 '20 07:01 hvr

I think the problem really is the overhead of thread creation, not about the buffer size. fast-builder implements a rather complicated trick to avoid this issue: http://hackage.haskell.org/package/fast-builder-0.1.2.0/docs/src/Data.ByteString.FastBuilder.Internal.html#toLazyByteStringWith

fumieval avatar Jan 16 '20 11:01 fumieval

I wasn't able to replace Builder with Mason. https://github.com/fumieval/mason/issues/2

Looks like fast builder doesn't provide options to wrap lazy string into builder.

yaitskov avatar May 20 '20 06:05 yaitskov

A new approach based on delimited continuations seems promising: https://github.com/ghc-proposals/ghc-proposals/pull/313#issuecomment-648114538

fumieval avatar Jun 23 '20 22:06 fumieval

@fumieval haven't looked at a lot of your work (which is rather broad btw, thank you), but have you considered something along the lines of builder or small-bytearray-builder. These could be re-worked to use Ptr rather easily, and I'm interested to see how they stack up.

Additionally, @andrewthad has done a lot of thinking about builders and may have some interesting input here.

chessai avatar Jun 25 '20 05:06 chessai

In bytebuild, I use GC-managed ByteArray# for everything instead of Addr#. Also, I don't use CPS. In bytes-builder-shootout, which uses backpack for perf benchmarking, I have observed that for encoding (recursive) tree-like structures, bytebuild outperforms bytestring, but fast-builder outperforms them both.

andrewthad avatar Jun 25 '20 12:06 andrewthad

@chessai small-bytearray-builder's design looks very interesting, but both builder and small-bytearray-builder produce fully strict byte arrays only, while we need streaming behaviour for toLazyByteString and hPutBuilder.

@andrewthad I didn't know about bytes-builder-shootout, that looks useful! If mason works as intended (primitives are well fused and inlined), producing strict ByteString should be faster than fast-builder. I'll try adding mason to bytes-builder-shootout

fumieval avatar Jun 25 '20 12:06 fumieval

I added mason to bytes-builder-shootout. the latest release of mason wasn't very good in this benchmark:

treeToHex-2000/small-bytearray-builder   mean 43.27 μs  ( +- 167.3 ns  )
treeToHex-2000/fast-builder              mean 34.22 μs  ( +- 95.85 ns  )
treeToHex-2000/bytestring                mean 60.76 μs  ( +- 3.829 μs  )
treeToHex-2000/mason                     mean 60.34 μs  ( +- 182.0 ns  )
treeToHex-9000/small-bytearray-builder   mean 190.1 μs  ( +- 761.5 ns  )
treeToHex-9000/bytestring                mean 289.9 μs  ( +- 1.700 μs  )
treeToHex-9000/mason                     mean 267.9 μs  ( +- 1.023 μs  )
short-text-tree-1000/small-bytearray-builder mean 27.30 μs  ( +- 142.7 ns  )
short-text-tree-1000/fast-builder        mean 37.36 μs  ( +- 639.8 ns  )
short-text-tree-1000/bytestring          mean 37.17 μs  ( +- 1.763 μs  )
short-text-tree-1000/mason               mean 29.42 μs  ( +- 422.5 ns  )
byte-tree-2000/small-bytearray-builder   mean 30.64 μs  ( +- 131.9 ns  )
byte-tree-2000/fast-builder              mean 28.82 μs  ( +- 114.9 ns  )
byte-tree-2000/bytestring                mean 53.86 μs  ( +- 314.2 ns  )
byte-tree-2000/mason                     mean 40.45 μs  ( +- 1.252 μs  )

I investigated the code and it turns out that the lack of worker-wrapper transformation (which isn't necessary in fast-builder and small-bytearray-builder) is slowing it down. Flattening the internal structure drastically improved the speed.

treeToHex-2000/mason                     mean 37.96 μs  ( +- 306.3 ns  )
treeToHex-9000/mason                     mean 167.6 μs  ( +- 719.6 ns  )
short-text-tree-1000/mason               mean 27.48 μs  ( +- 151.2 ns  )
byte-tree-2000/mason                     mean 32.12 μs  ( +- 170.2 ns  )

fumieval avatar Jun 28 '20 06:06 fumieval

@fumieval is there anything actionable here?

Bodigrim avatar Mar 16 '22 20:03 Bodigrim

Unfortunately, mason's performance regressed a lot since GHC 9.2, and I didn't manage to figure out why. I decided to put the project on hold until delimited continuation primops arrives.

fumieval avatar Mar 17 '22 03:03 fumieval

@fumieval Delimited Continuations has merged https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7942

Any new progress?

BebeSparkelSparkel avatar Jan 15 '24 19:01 BebeSparkelSparkel