merkletree
merkletree copied to clipboard
more efficient ReaderRoot algorithm
Benchmarks:
BenchmarkReader64_1k-4 100000 15137 ns/op 272 B/op 7 allocs/op
BenchmarkReader64_4MB-4 20 63088089 ns/op 656 B/op 19 allocs/op
BenchmarkReader4k_4MB-4 100 11590328 ns/op 4496 B/op 13 allocs/op
BenchmarkReaderTree64_1k-4 100000 19231 ns/op 3984 B/op 84 allocs/op
BenchmarkReaderTree64_4MB-4 20 78517088 ns/op 14681616 B/op 327686 allocs/op
BenchmarkReaderTree4k_4MB-4 100 12885906 ns/op 4362960 B/op 5125 allocs/op
As it turns out, allocating memory is indeed really fast! In the 64_4MB benchmark, the new algorithm eliminated 327667 allocations totaling ~15MB, but was only 15ms faster.
I ran a real-world test in which I downloaded a 475 MB file from Sia. Without the change, 1.5 GB was allocated towards Merkle root calculations. After the change, that figure dropped to 4 KB.
For the benchmarks, are some of those the old code or are they just all the new code?
I think the best place to focus our time if we are worried about performance is:
- shrink the reed-solomon encoding + decoding to doing ~4kb at a time (maybe play around + compare vs. benches)
- put a lock on it so that only 1 proc can be doing it at a time (cap peak allocation)
- do 4kb sub-shards in multiple goroutines to get more parallelism (maxprocs should be fine, it'll still only be a few MB max, instead of hundreds of MB), while still keeping locks so that only one uploader or downloader is allowed to be using the RS stuff at a time.
I also don't like how much complexity has been added to a function that used to be 4 lines of code, especially if the gains are very tiny. It's gone from being obviously correct to starting out with a non-terminating forloop, having a cache of slices for nodes, and a combination algorithm that needs a filter.
I don't think it's worth the tradeoff.
BenchmarkReader is the new algo, BenchmarkReaderTree is the old algo.
It's true that it is more complex, but the old algo wasn't really just 4 lines of code; much of the work happens inside ReadAll (which uses a non-terminating for loop, btw). If you incorporate the code in t.Push, I'd say it's at least as complex as the new algo, if not more. That feels a bit unfair, since the t.Push code is reused elsewhere, but I'd still argue that the new algo is easier to understand in isolation. It's entirely self-contained (no side-effects) and there isn't any proof-related code woven in.
Mostly I just feel like it's embarrassing that a merkletree library would allocate 3x the amount of memory that it's hashing. Here's a real-world example of calculating the root of a 1.3GB file:
BenchmarkReader64_1GB-4 25189475379 ns/op 1096 B/op 30 allocs/op
BenchmarkReaderTree64_1GB-4 31696072018 ns/op 4620552440 B/op 103137296 allocs/op
The old algo allocates 4 million times as much memory and takes 7 seconds longer. Is 7 seconds a big deal when you're transferring 1GB over the network? Not right now, but on a gigabit connection you can transfer 1GB in 8 seconds.
I agree that where Sia is concerned, optimizing Merkle roots isn't the top priority. Sia already feels quite responsive except for a few known actions: uploading/downloading, unlocking the wallet, and shutting down. The biggest (performance) issue right now is that it hogs too much RAM, to the point of being killed by the OS. I'll work on implementing your suggestions there soon.
7 out of 31 seconds is not a lot until it's the lowest hanging fruit. Optimizing the most correctness-critical code in the codebase is not what we should be focused on, I spent something like 40 programmer-hours verifying the old code was correct, this code doesn't have anywhere near that much hardening.
I don't want to merge anything into this repo without giving it a similar amount of attention. Embarrassing memory management or not, it's correct crypto (at least, with high confidence) and that's not embarrassing.
TestReaderMatch is intended to catch any discrepancies between the old algo and the new algo. That way, we can tie the correctness of the new algo to the well-established correctness of the old algo. It's not exhaustive, but it gives fairly high confidence.
I also fuzzed the new algo. Apparently we did not fuzz the old ReaderRoot because it quickly turned up a bug: if the segment size is zero, ReaderRoot hangs. I fixed this behavior and added a regression test.
Since this isn't our top priority, I'm fine with putting it on hold for a while.
I guess we did not fuzz ReaderRoot. Tying together the correctness of the old testing and the new testing is a good first step.