risingwave
risingwave copied to clipboard
Explore the causes of anomalous meta misses
https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&from=1719327583739&to=1719373395832&var-namespace=reglngvty-20240625-150224&var-instance=&var-pod=All&var-component=All&var-table=All
In a recent test (run all nexmark 10k), I found that the system was experiencing anomalous meta misses during the test, and the increased latency caused by the meta misses would drastically reduce the throughput of the system, resulting in higher barrier latency.
-
latency
-
There are several characteristics that can indicate a meta miss
- object num and sst_meta_size
- Most meta cache refills are successful
In test, the meta cache is only 1.2g, and we believe that a meta cache miss is possible (even if the refill does not fail). We assume a scenario
- The operator holds a Pin version, and uses the Pin version to access hummock.
- After compaction, update the Pin version of CN through version delta.
- The newly arrived version delta triggers meta cache refill
- Eviction is triggered due to insufficient cache capacity (old version sst meta is evicted). 5.1 Meta miss occurs
At this point, we have populated the meta cache on all write paths.
- cn sstable upload completed, writer will insert <object_id, meta> into meta cache
- after compaction, refiller performs a meta cache refill based on the version delta, inserting <object_id, meta> into the meta cache
Apart from the above meta cache due to eviction, it seems that the system does not have any more meta misses, but I have found that meta misses are encountered before the memory cache is filled, and the meta misses increase over time.
The reason for this is hypothesized to be
- part of the information is missing when the version delta builds sst_delta_info (bias)
- wrong object_id / meta inserted into meta cache
- data written to cache is not visible