neon
neon copied to clipboard
Epic: implement disk size tracking for tenants and timelines
we're already limiting and reporting the logical disk size limit per timeline ('branch')
we need to start tracking this number per tenant to decide how to define and implement the limits we also need to understand the real costs that are attached to the physical disk space for a tenant
In general size tracking in the presence of branches is an open question, because branches can share history, and it is not trivial how to present this info. For something to start with lets gather physical size in a simple way: for tenant/timeline directory. So the physical size is simply a sum of everything located in tenant/timeline directory. Implementation steps:
- [x] make timeline_detail endpoint return sum of all file sizes inside timeline directory
- [x] add a metric with the same value, it can be updated from gc thread for example
- [x] add tenant physical size to tenant status endpoint
- [ ] add corresponding metrics
- [ ] #1218
Timeline sizecan be cached in a repository map or in tenant_mgr inside tenant's .local_timelines
to avoid counting the value twice for metrics and to answer HTTP queries. Tenant physical size is a sum of all timelines sizes + size of a tenant config file so can be calculated on demand
we're already limiting disk size limit per timeline ('branch')
just curious - is there any design-doc/ or write-up, that you can point me to understand better?
we're already limiting disk size limit per timeline ('branch')
just curious - is there any design-doc/ or write-up, that you can point me to understand better?
@phoenix24 , is this helpful?
@kelvich to provide proposal about physical sizes
Implementation steps:
- [x] make timeline_detail endpoint return sum of all file sizes inside timeline directory
- [x] add a metric with the same value, it can be updated from gc thread for example
- [x] add tenant physical size to tenant status endpoint
- [ ] add corresponding metrics
- [ ] #1218
Copied from the list of action items in the issue desc,
add corresponding metrics
I guess this is referred to metric for tenant's physical size, which is the sum of timeline's physical size. We do have a metric for single timeline's physical size: pageserver_current_physical_size{tenant_id="...",timeline_id="..."} ...
. I think prometheus
should provide a way to create a metric as a summation of pageserver_current_physical_size
with a specific tenant_id
. This way, we don't need to implement a separate metric for tenant's physical size.
Regarding #1218, I'm not so sure about the status of this issue. We do have APIs to report logical size for pageserver and with #2126 and #2173 added, we should also have metrics/APIs to report physical size. Is it safe to mark that issue as completed?
About the next steps, a possible list of action items:
- [ ] update console codes to display the tenant/timeline's physical size
- [ ] determine the tenant's physical size limit and add monitoring to notify users about potential limit excess
cc @kelvich @stepashka @LizardWizzard
@ololobus , i've heard that we have the new APIs now! can we expose this data in the console admin?
i've heard that we have the new APIs now! can we expose this data in the console admin?
If you mean the timeline physical size, then yes, we can add it on the project page. Feel free to create an issue in the cloud
repo and put in on me. I can either do it on my own, or wait for someone's else capacity. Also sounds like a good first issue to do
I've added the item:
- [ ] add timeline logical size prometheus metrics
I will create an issue for the console team for adding the size metric there
@kelvich is going to write up a proposal for tracking some other size-related metrics among the ideas were
- track the PITR-data size for each timeline
- track the physical size excluding the PITR data for each timeline (how much space would be used if PITR is tuned to 0)
We do have a metric for single timeline's physical size:
pageserver_current_physical_size{tenant_id="...",timeline_id="..."} ...
. I thinkprometheus
should provide a way to create a metric as a summation ofpageserver_current_physical_size
with a specifictenant_id
. This way, we don't need to implement a separate metric for tenant's physical size.
we can calculate the tenant size outside pageserver IF sizes of all of the existing tenant timelines are reported by the pageserver (or not reported) simultaneously (in the metrics endpoint and pageserver API). are they, @hlinnaka ? (otherwize the sum will not be representing the actual tenant size at any specific time)
add corresponding metrics
What does this mean? Can we close this?