neon
neon copied to clipboard
Epic: improved eviction
This change will improve the pageserver's ability to accomodate larger tenants, especially those with append-heavy work.
### Tasks
- [ ] https://github.com/neondatabase/neon/issues/4745
- [ ] https://github.com/neondatabase/neon/pull/6132
- [ ] https://github.com/neondatabase/neon/issues/5304
- [x] Enable in prod on one region https://github.com/neondatabase/aws/pull/930
- [ ] #6490
- [ ] #6491
- [ ] 10min hang together with `layer_delete` during disk usage based eviction
- [ ] #6634
- [ ] #6598
- [ ] https://github.com/neondatabase/aws/pull/976
- [ ] Time-based eviction is not needed when operating with a disk usage target (e.g. target usage 70%, critical usage 85%). We only need space-driven eviction, but perhaps with two modes, one where it trims ruthlessly and another where it aims to avoid trimming things which are more recent than latest compaction.
- [ ] https://github.com/neondatabase/neon/issues/6835
- [ ] For thrashing alterting, we will need to define a new measurement based on the time between evicting a layer and re-downloading it. A threshold approximately 20-60 minutes would be used, based on (disk_size / disk_bandwidth), where a totally streaming workload would be expected to fit in cache over that timescale (John's heuristic for thrashing
- [ ] fast redownloads could log the access history or at least highlight "fastness" on the log line for easier finding
- [ ] consider again not requiring read access to Timeline::layers for evictable layers query
Moved to inprogress as opening PRs this week and testing in staging.
Did not yet get to start, but hopeful of this week.
Discussed this just now, trying to summarize.
My open question had been the "advertised resident set size", and what did it mean to not do time threshold based eviction. Latter is easier, and becomes obvious once you look at our redownloaded_at
histogram -- we simply would not do it, and we wouldn't have any of those very fast redownloads.
I pointed out that if we would no longer do threshold based eviction, we might not settle on anything resembling "resident set size" as it would be guided by current EvictionOrder
and frequency of executing the more tame version or the more critical version. For a better estimate @jcsp had been thinking MAX(timelines.logical_size)
with a fudge factor or just the plain synthetic size. This would be used to advertise if we can accept more tenants.
Post-call comparison of synthetic size sums on a single region where there are mostly 12k tenants per pageserver to sum(max(tenant.timelines.logical_size))
with the special pageservers removed gives a range of 1.32 to 3.08, meaning the suggested fudge factor of 2 might work. Removing threshold based eviction does not matter for this rational.
Related to the very fast redownloads, there are two guesses as for the reason which I investigated after the call:
- synthetic size calculation where the point in which we calculate the latest logical size moves forward could get unlucky (in relation to imitation not "catching" this)
-
one evidence, last_record_lsn increase -- no
check_availability
operations -
one evidence, last_record_lsn increase -- no
check_availability
operations - one evidence, same tenant as above
-
one evidence, last_record_lsn increase -- no
- availability check ends up redownloading those, because we don't directly imitate basebackup -- but we don't imitate basebackup because it is thought to be covered
- couldn't find any evidence with the really small sample size 130 on-demand downloads
These guesses are sort of related however, a check_availability
might produce WAL (at least it did at one point), so it might cause the synthetic size logical size point to move forward.
One action point, soon recorded in the task list: need to make the logging better however for redownloads, as I only added it as a histogram and these searches are really expensive.
Got to testing #5304 today due to unrelated staging problems. I need to go over the actual results on ps-7.us-east-2.aws.neon.build
.
Assuming the results are sane, the next steps are:
- cleanup the summary messages (semi-revert #6384, keep
select_victims
refactoring) - introduce a per timeline evictiontask mode which does not evict but only imitiates
- perhaps introduce a second mode (or don't) for disk usage based eviction
- staging: we restart quite often, so pageserver inmemory state is reset often
- production: we restart much more rarely so perhaps there is no real need
Post-discussion afterthought: if we do disk usage based eviction before all imitations are complete, should the eviction be Lsn based or random...?
After further discussion with @jcsp and some review of testing results refined the next steps:
- testing on staging without per timeline eviction task to make sure huge layer counts are not noticeable for disk usage based eviction
- enable on one production region which has high disk usage right now (50%)
Next:
- Implement the imitate-only task so that we can disable time based eviction.
- Engage CP team to agree new API for exposing a size heuristic to unblock moving to disk-only (no time based) eviction
- Enable relative eviction in prod configs
#6491 and #6598 are ready-ish to go but I forgot the config updates from last week.
Discussion about pageserver owning the "is good for next tenant" is barely started.
Next step:
- Define the interface for CP for utilization
- Avoid taking tenant locks when collecting layers to evict.
PR list has these open:
- enable rel eviction in prod -- we should merge it
- imitation only eviction task policy -- reuses the metrics, but we shouldn't have any different configured per tenant
- rwlock contention needs refreshing of review
Next steps:
- write up an issue on the new endpoint (next bullet)
- impl the endpoint for querying how good the PS thinks it is for the next tenant
This week testing out the imitation only policy on staging and deciding if we need to complicate eviction candidate discovery #6224. With imitation only, we will finally run with a high amount of layers all of the time, and disk usage based eviction will run often.
Alternatives to #6224:
- evict earlier non-hit layers after creating image layers
Before testing this week:
- task list has a logging improvement
- metric improvement for understanding how bad is the current layer discovery
- could also do the low hanging fruit optimizations there
Extra notes:
- The try_lock change was reverted for lack of evidence that it was the underlying cause
- So ~10 minute hang is still probably in there: expect to see reproduction in staging testing
New dashy for the metrics added in #6131: https://neonprod.grafana.net/d/adecaputaszcwd/disk-usage-based-eviction?orgId=1 -- so far there has not been any disk usage based evictions on staging.
Work this week:
- staging shows an performance issue with
struct Layer
ordisk usage based eviction
collection - further testing in staging together with OnlyImitiate policy
- we will likely roll out continious disk usage based eviction to a single pageserver in prod in a region which has great tenant imbalance
Last week:
- the performance issue was identified on staging, and #7030 was created
- troubles creating even 10GB pgbench databases due to primary key query repeatedly being interrupted because of a SIGHUP (https://github.com/neondatabase/cloud/issues/11023)
This week:
- split the #7030, get reviews through-out the week
- migrate more tenants on to pageserver-1.eu-west-1
split the https://github.com/neondatabase/neon/pull/7030, get reviews through-out the week
Note to self: this about hangs in disk usage based eviction while collecting layers.
Latest troubles in staging have provided good ground for disk usage based eviction runs (pageserver-[01].eu-west-1
), listing the examined outliers after #6131:
2024-03-14T12:04:55.501895Z INFO disk_usage_eviction_task:iteration{iteration_no=1093}: collection took longer than threshold tenant_id=9d984098974b482e25f8b85560f9bba3 shard_id=0000 elapsed_ms=15367
Participated in 9 downloads.
2024-03-14T12:15:44.980494Z INFO disk_usage_eviction_task:iteration{iteration_no=1155}: collection took longer than threshold tenant_id=a992f0c69c3d69b7338586750ba3f9c1 shard_id=0000 elapsed_ms=12523
Participated in 1 download.
2024-03-14T12:18:45.162630Z INFO disk_usage_eviction_task:iteration{iteration_no=1168}: collection took longer than threshold tenant_id=7affec0a9fdf9da5b3638894a84cb9cc shard_id=0000 elapsed_ms=13364
Participated in 1 download.
2024-03-14T12:18:59.848429Z INFO disk_usage_eviction_task:iteration{iteration_no=1168}: collection took longer than threshold tenant_id=a776112dba9d2adbb7a7746b6533125d shard_id=0000 elapsed_ms=10176
Participated in 2 downloads.
2024-03-14T12:19:27.135951Z INFO disk_usage_eviction_task:iteration{iteration_no=1168}: collection took longer than threshold tenant_id=f231e5ac37f956babb1cc98dcfb088ce shard_id=0000 elapsed_ms=17911
Participated in 1 download.
Before and after #7030:
2024-03-21T02:41:07.618648Z INFO disk_usage_eviction_task:iteration{iteration_no=1362}: collection completed elapsed_ms=4969 total_layers=83690
2024-03-21T03:53:43.072165Z INFO disk_usage_eviction_task:iteration{iteration_no=400}: collection completed elapsed_ms=135 total_layers=83695
The set of PRs culminating with #7030 also removed the "10min hang" previously observed. Later more evidence came that it was caused by waiting for a download. For other fixed cases, see: https://github.com/neondatabase/neon/issues/6028#issuecomment-1976443012
pageserver_layer_downloaded_after
metric is still not being used to alert because many cases in staging cause redownloads very soon after evicting. In production, the old mtime-based trashing alert has been downgraded as a warning. It is not known why we get into this situation.
Log analysis is still too time-consuming to spot any patterns. #7030 preliminaries also included fixes for updating this metric. The best guess so far is that we get unlucky with:
- evict
- initiate layer accesses right after
However, in the short time between (1) to (2), the PITR could have advanced just enough to warrant new synthetic size calculation, for example.
The utilization endpoint work has just not been started.