neon
neon copied to clipboard
Revisit bucket values for STORAGE_TIME histogram
I decided to run pgbench init with scale 600 against staging environment and check what are the durations for typical storage operations.
On the graphs I see that layer flush sometimes takes longer than the highest bucket boundary. We following default buckets from prometheus library: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0
The query I used to check the values.
Maybe it is ok to have some skew in these values, and we need to make it faster instead :) Is it normal that flush can take more than 10 seconds? @hlinnaka
This is the "layer flush" metric? Yeah, it can take a long time if the I/O subsystem is busy. If we see that regularly in production, though, it's probably time to scale out the pageservers.
On a quick chat we discussed this and the follow up step to check what actually takes time, maybe layer flush waits on some locks or something else
@LizardWizzard we now have STORAGE_TIME as both global and per timelines, is this still relevant ?
We know have larger buckets too, so this is no longer relevant