neon icon indicating copy to clipboard operation
neon copied to clipboard

Epic: implement disk size tracking for tenants and timelines

Open stepashka opened this issue 2 years ago • 6 comments

we're already limiting and reporting the logical disk size limit per timeline ('branch')

we need to start tracking this number per tenant to decide how to define and implement the limits we also need to understand the real costs that are attached to the physical disk space for a tenant

In general size tracking in the presence of branches is an open question, because branches can share history, and it is not trivial how to present this info. For something to start with lets gather physical size in a simple way: for tenant/timeline directory. So the physical size is simply a sum of everything located in tenant/timeline directory. Implementation steps:

  • [x] make timeline_detail endpoint return sum of all file sizes inside timeline directory
  • [x] add a metric with the same value, it can be updated from gc thread for example
  • [x] add tenant physical size to tenant status endpoint
  • [ ] add corresponding metrics
  • [ ] #1218

Timeline sizecan be cached in a repository map or in tenant_mgr inside tenant's .local_timelines to avoid counting the value twice for metrics and to answer HTTP queries. Tenant physical size is a sum of all timelines sizes + size of a tenant config file so can be calculated on demand

stepashka avatar Jun 07 '22 14:06 stepashka

we're already limiting disk size limit per timeline ('branch')

just curious - is there any design-doc/ or write-up, that you can point me to understand better?

phoenix24 avatar Jun 08 '22 10:06 phoenix24

we're already limiting disk size limit per timeline ('branch')

just curious - is there any design-doc/ or write-up, that you can point me to understand better?

@phoenix24 , is this helpful?

stepashka avatar Jul 05 '22 08:07 stepashka

@kelvich to provide proposal about physical sizes

kelvich avatar Jul 05 '22 14:07 kelvich

Implementation steps:

  • [x] make timeline_detail endpoint return sum of all file sizes inside timeline directory
  • [x] add a metric with the same value, it can be updated from gc thread for example
  • [x] add tenant physical size to tenant status endpoint
  • [ ] add corresponding metrics
  • [ ] #1218

Copied from the list of action items in the issue desc,

add corresponding metrics

I guess this is referred to metric for tenant's physical size, which is the sum of timeline's physical size. We do have a metric for single timeline's physical size: pageserver_current_physical_size{tenant_id="...",timeline_id="..."} .... I think prometheus should provide a way to create a metric as a summation of pageserver_current_physical_size with a specific tenant_id. This way, we don't need to implement a separate metric for tenant's physical size.

Regarding #1218, I'm not so sure about the status of this issue. We do have APIs to report logical size for pageserver and with #2126 and #2173 added, we should also have metrics/APIs to report physical size. Is it safe to mark that issue as completed?

About the next steps, a possible list of action items:

  • [ ] update console codes to display the tenant/timeline's physical size
  • [ ] determine the tenant's physical size limit and add monitoring to notify users about potential limit excess

cc @kelvich @stepashka @LizardWizzard

aome510 avatar Jul 28 '22 18:07 aome510

@ololobus , i've heard that we have the new APIs now! can we expose this data in the console admin?

stepashka avatar Aug 11 '22 13:08 stepashka

i've heard that we have the new APIs now! can we expose this data in the console admin?

If you mean the timeline physical size, then yes, we can add it on the project page. Feel free to create an issue in the cloud repo and put in on me. I can either do it on my own, or wait for someone's else capacity. Also sounds like a good first issue to do

ololobus avatar Aug 11 '22 14:08 ololobus

I've added the item:

  • [ ] add timeline logical size prometheus metrics

I will create an issue for the console team for adding the size metric there

stepashka avatar Aug 12 '22 13:08 stepashka

@kelvich is going to write up a proposal for tracking some other size-related metrics among the ideas were

  • track the PITR-data size for each timeline
  • track the physical size excluding the PITR data for each timeline (how much space would be used if PITR is tuned to 0)

stepashka avatar Aug 12 '22 14:08 stepashka

We do have a metric for single timeline's physical size: pageserver_current_physical_size{tenant_id="...",timeline_id="..."} .... I think prometheus should provide a way to create a metric as a summation of pageserver_current_physical_size with a specific tenant_id. This way, we don't need to implement a separate metric for tenant's physical size.

we can calculate the tenant size outside pageserver IF sizes of all of the existing tenant timelines are reported by the pageserver (or not reported) simultaneously (in the metrics endpoint and pageserver API). are they, @hlinnaka ? (otherwize the sum will not be representing the actual tenant size at any specific time)

stepashka avatar Aug 12 '22 14:08 stepashka

add corresponding metrics

What does this mean? Can we close this?

hlinnaka avatar Aug 22 '22 15:08 hlinnaka