wire: Optimize writes to bytes.Buffer
This special cases writes to bytes.Buffer, which is always the writer type written to by WriteMessageN. There are several optimizations that can be implemented by special casing this type:
First, pulling temporary short buffers from binary freelist can be skipped entirely, and instead the binary encoding of integers can be appended directly to its existing capacity. This avoids the synchronization cost to add and remove buffers from the free list, and for applications which only ever write wire messages with WriteMessageN, the allocation and ongoing garbage collection scanning cost to for these buffers can be completely skipped.
Second, special casing the buffer type in WriteVarString saves us from creating a temporary heap copy of a string, as the buffer's WriteString method can be used instead.
Third, special casing the buffer allows WriteMessageN to calculate the serialize size and grow its buffer so all remaining appends for writing block and transactions will not have to reallocate the buffer's backing allocation. This same optimization can be applied to other messages in the future.
making a draft for now, this only optimized the common.go code, but these Put* methods are used frequently directly in the BtcEncode implementations too.
This is the relative improvement from current master branch, on the benchmark which uses WriteMessageN to serialize the wire encoding (with header) of the genesis block coinbase tx.
$ benchstat old.txt new.txt
goos: openbsd
goarch: amd64
pkg: github.com/decred/dcrd/wire
cpu: AMD Ryzen 7 5800X3D 8-Core Processor
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
WriteMessageN-8 2.782µ ± 0% 1.633µ ± 0% -41.32% (p=0.000 n=10)
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
WriteMessageN-8 592.0 ± 0% 384.0 ± 0% -35.14% (p=0.000 n=10)
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
WriteMessageN-8 7.000 ± 0% 5.000 ± 0% -28.57% (p=0.000 n=10)
I'm debating whether to remove the buffer grows to the tx/block serialize size from BtcEncode, and rather only do this from WriteMessageN, the reason being that there are existing callers that also already calculate the serialize size for an appropriately sized buffer, and it makes no sense to calculate this twice.
Now rebased over #3584. Below is the performance improvement with both PRs, relative to master:
$ benchstat old.txt new.txt
goos: openbsd
goarch: amd64
pkg: github.com/decred/dcrd/wire
cpu: AMD Ryzen 7 5800X3D 8-Core Processor
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
WriteMessageN-8 2.784µ ± 2% 1.652µ ± 1% -40.67% (p=0.000 n=10)
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
WriteMessageN-8 592.0 ± 0% 328.0 ± 0% -44.59% (p=0.000 n=10)
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
WriteMessageN-8 7.000 ± 0% 3.000 ± 0% -57.14% (p=0.000 n=10)