neon icon indicating copy to clipboard operation
neon copied to clipboard

Revisit bucket values for STORAGE_TIME histogram

Open LizardWizzard opened this issue 3 years ago • 2 comments

I decided to run pgbench init with scale 600 against staging environment and check what are the durations for typical storage operations.

On the graphs I see that layer flush sometimes takes longer than the highest bucket boundary. We following default buckets from prometheus library: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0

The query I used to check the values.

Maybe it is ok to have some skew in these values, and we need to make it faster instead :) Is it normal that flush can take more than 10 seconds? @hlinnaka

LizardWizzard avatar Jul 21 '22 13:07 LizardWizzard

This is the "layer flush" metric? Yeah, it can take a long time if the I/O subsystem is busy. If we see that regularly in production, though, it's probably time to scale out the pageservers.

hlinnaka avatar Jul 22 '22 09:07 hlinnaka

On a quick chat we discussed this and the follow up step to check what actually takes time, maybe layer flush waits on some locks or something else

LizardWizzard avatar Jul 22 '22 10:07 LizardWizzard

@LizardWizzard we now have STORAGE_TIME as both global and per timelines, is this still relevant ?

shanyp avatar Jul 19 '23 07:07 shanyp

We know have larger buckets too, so this is no longer relevant

LizardWizzard avatar Jul 21 '23 13:07 LizardWizzard