neon Revisit bucket values for STORAGE

Revisit bucket values for STORAGE_TIME histogram

Open LizardWizzard opened this issue 3 years ago • 2 comments

I decided to run pgbench init with scale 600 against staging environment and check what are the durations for typical storage operations.

On the graphs I see that layer flush sometimes takes longer than the highest bucket boundary. We following default buckets from prometheus library: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0

The query I used to check the values.

Maybe it is ok to have some skew in these values, and we need to make it faster instead :) Is it normal that flush can take more than 10 seconds? @hlinnaka

Jul 21 '22 13:07 LizardWizzard

This is the "layer flush" metric? Yeah, it can take a long time if the I/O subsystem is busy. If we see that regularly in production, though, it's probably time to scale out the pageservers.

Jul 22 '22 09:07 hlinnaka

On a quick chat we discussed this and the follow up step to check what actually takes time, maybe layer flush waits on some locks or something else

Jul 22 '22 10:07 LizardWizzard

@LizardWizzard we now have STORAGE_TIME as both global and per timelines, is this still relevant ?

Jul 19 '23 07:07 shanyp

We know have larger buckets too, so this is no longer relevant

Jul 21 '23 13:07 LizardWizzard

neon neon copied to clipboard

Revisit bucket values for STORAGE_TIME histogram

neon
neon copied to clipboard