risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

Explore the causes of anomalous meta misses

Open Li0k opened this issue 1 year ago • 1 comments

https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=P2453400D1763B4D9&from=1719327583739&to=1719373395832&var-namespace=reglngvty-20240625-150224&var-instance=&var-pod=All&var-component=All&var-table=All

In a recent test (run all nexmark 10k), I found that the system was experiencing anomalous meta misses during the test, and the increased latency caused by the meta misses would drastically reduce the throughput of the system, resulting in higher barrier latency.

  • latency image

  • There are several characteristics that can indicate a meta miss

image image
  • object num and sst_meta_size
image image
  • Most meta cache refills are successful
image

In test, the meta cache is only 1.2g, and we believe that a meta cache miss is possible (even if the refill does not fail). We assume a scenario

  1. The operator holds a Pin version, and uses the Pin version to access hummock.
  2. After compaction, update the Pin version of CN through version delta.
  3. The newly arrived version delta triggers meta cache refill
  4. Eviction is triggered due to insufficient cache capacity (old version sst meta is evicted). 5.1 Meta miss occurs

At this point, we have populated the meta cache on all write paths.

  1. cn sstable upload completed, writer will insert <object_id, meta> into meta cache
  2. after compaction, refiller performs a meta cache refill based on the version delta, inserting <object_id, meta> into the meta cache

Apart from the above meta cache due to eviction, it seems that the system does not have any more meta misses, but I have found that meta misses are encountered before the memory cache is filled, and the meta misses increase over time.

image

The reason for this is hypothesized to be

  1. part of the information is missing when the version delta builds sst_delta_info (bias)
  2. wrong object_id / meta inserted into meta cache
  3. data written to cache is not visible

Li0k avatar Jun 26 '24 16:06 Li0k