eos
eos copied to clipboard
Enabling trace_api_plugin slows down replays
With trace_api_plugin enabled replay speed drops by two orders of magnitude and CPU and I/O utilization drops to near zero.
Same behavior seen on 2.2-rc1, 2.1 and 2.0.6
Hardware info:
Running in docker using Ubuntu 18.04, host system is a Hetzner AX41-NVME running Debian 10
same here, we are using ubuntu 18.04, with trace api open, cleos get info get a response with 10 seconds
1 [||||||||||||||||||||||||100.0%] Tasks: 33, 25 thr; 4 running 2 [||||||||||||||||||||||||100.0%] Load average: 3.06 3.17 3.16 3 [||||||||||||||| 44.4%] Uptime: 12 days, 09:11:17 4 [||||||||||||||||||||| 65.6%] Mem[||||||||||||||||||||488M/7.79G]
I've tracked it down to https://github.com/EOSIO/eos/blob/master/plugins/trace_api_plugin/include/eosio/trace_api/store_provider.hpp#L71
Calling sync after every write during replays is what slows it all down to a crawl
I wonder if docker is causing that to be slow. Maybe you could test iops inside your docker and see what it looks like.
The sync would make sense as to what's causing the slowdown and docker could be adding to that.
The non-container trace apis we are running today write to ZFS volumes, so I'll bet we didn't experience this issue before since ZFS immediately acknowledges the write back to the nodeos and then queues its own sync in the background.
since ZFS immediately acknowledges the write back to the nodeos and then queues its own sync in the background
This sounds frightening. A filesystem ignoring sync requests sounds broken. Is that really how ZFS operates?
It can be configured to operate that way, yeah. You gain massive performance benefits since the filesystem operates using RAM as a buffer for all writes, and then queues it out to actually be written in a more intelligent manner (decent explainer here in the 2nd section about writes). You lose some level of fault tolerance if you go for the maximum performance route. This isn't a huge issue in our experiences with nodeos though, since if nodeos gets interrupted due to a power outage or hardware failure, there's a high likelihood that the state is trashed anyways requiring a replay of some kind.
No other part of nodeos is calling fsync after every write from what I can tell, can this really be expected to be performant without a filesystem that can be configured to lie to applications about it?
In replay with trace_api_plugin enabled it does 2 syncs per block, testing using the script from this article my NVMe disk is getting ~700 syncs/s (same numbers on bare metal and in docker so that does not seem to be the limiting factor).
That puts the upper bound on blocks that can be processed with the plugin enabled to ~350 per second.
Debug build of v2.1.0
replayed 28471 blocks in 493 seconds, 17.35102550052686965 ms/block
Debug build of v2.1.0 with .sync() call commented out
replayed 465896 blocks in 305 seconds, 0.65595844557250027 ms/block
Agreed, it does seem rather surprising that trace_api_plugin tries to ensure durability here whereas neither the block log nor ship log do, I believe. But this trace_api_plugin durability seems to be very deliberate -- it smartly syncs the log before syncing the index (though I suspect it's still not safe in pessimistic scenarios because it appears the metadata_log_entry could span over multiple disk sectors). The deliberateness does make me wonder if the behavior was very much part of its intended design. I would encourage a closer inspection of the design before ripping it out. Even a seemingly safe optimization such as skipping fsyncing during catch up may be at odds with any intended durability behavior.
trace_api_plugin supports multi-threaded reads from the logs. The sync, if I remember correctly, was to provide safe reading from other threads.