docs icon indicating copy to clipboard operation
docs copied to clipboard

Add docs on storage engine WAL failover

Open rmloveland opened this issue 1 year ago • 10 comments

Fixes:

  • DOC-9709
  • DOC-9916
  • DOC-9925
  • DOC-10149

Summary of changes:

  • Add a new section to cockroach start describing the WAL failover feature, how to enable/disable, and the related logging config changes that are needed if you enable the feature

  • Add a new section to 'Monitoring and Alerting' docs describing the store status endpoint at _status/stores

  • Update logging docs to add some anchor links so we can refer to specific config settings from the WAL failover docs

  • Update v24.1 alpha release notes to link to the WAL failover docs

rmloveland avatar May 01 '24 15:05 rmloveland

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
Latest commit 094dd7f2a8aca3c2436a8c1f94038613649d4586
Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/66464b647ffaa000081bbc06

netlify[bot] avatar May 01 '24 15:05 netlify[bot]

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
Latest commit 094dd7f2a8aca3c2436a8c1f94038613649d4586
Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/66464b643600430008f637a4

netlify[bot] avatar May 01 '24 15:05 netlify[bot]

Netlify Preview

Name Link
Latest commit 094dd7f2a8aca3c2436a8c1f94038613649d4586
Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/66464b648dbce800093953e1
Deploy Preview https://deploy-preview-18511--cockroachdb-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar May 01 '24 15:05 netlify[bot]

Hi folks, I added each of you to review for the following reasons/areas, but please feel free to comment on anything you see that is missing/incorrect/etc as well:

  • @jbowens for WAL failover stuff
  • @abarganier for async file logging stuff
  • @mwang1026 for overall vibe

rmloveland avatar May 07 '24 16:05 rmloveland

A few minor edit suggestions -- notably a few more places I think we should mention that feature is in PREVIEW

Updated in the places where you mentioned. Do we also want to add this to the list of Features in Preview for v24.1? My assumption is yes but figured I'd ask while you're here

I think we should also open a separate PR for how to monitor for failover -- the metrics to watch, how to inspect them, etc. -- thoughts?

That makes sense, I can make a followup PR - @jbowens I found the following metrics on the custom chart debug page of a 24.1 RC cluster. Which ones do you think make sense to monitor and what values should one alert on? Based on looking I'd guess switch.count could be a starting point? followed by the durations? but I'm just guessing :-)

  • storage.wal.failover.primary.duration
  • storage.wal.failover.secondary.duration
  • storage.wal.failover.switch.count
  • storage.wal.failover.write_and_sync.latency-avg
  • storage.wal.failover.write_and_sync.latency-count
  • storage.wal.failover.write_and_sync.latency-max
  • storage.wal.failover.write_and_sync.latency-sum
  • storage.wal.failover.write_and_sync.latency-p50
  • storage.wal.failover.write_and_sync.latency-p75
  • storage.wal.failover.write_and_sync.latency-p90
  • storage.wal.failover.write_and_sync.latency-p99
  • storage.wal.failover.write_and_sync.latency-p99.9
  • storage.wal.failover.write_and_sync.latency-p99.99
  • storage.wal.failover.write_and_sync.latency-p99.999

rmloveland avatar May 10 '24 19:05 rmloveland

Which ones do you think make sense to monitor and what values should one alert on? Based on looking I'd guess switch.count could be a starting point? followed by the durations? but I'm just guessing :-)

Yeah, I think it makes sense to document those first three metrics:

storage.wal.failover.primary.duration
storage.wal.failover.secondary.duration
storage.wal.failover.switch.count

The storage.wal.failover.secondary.duration is probably the most interesting. Customers will generally expect this to be zero unless there's a failover. Then they might care about how long it remains non-zero because it provides indication into the health of the primary.

jbowens avatar May 13 '24 19:05 jbowens

Yeah, I think it makes sense to document those first three metrics:

storage.wal.failover.primary.duration
storage.wal.failover.secondary.duration
storage.wal.failover.switch.count

The storage.wal.failover.secondary.duration is probably the most interesting. Customers will generally expect this to be zero unless there's a failover. Then they might care about how long it remains non-zero because it provides indication into the health of the primary.

Thanks @jbowens - I've filed https://cockroachlabs.atlassian.net/browse/DOC-10268 and will do that as a followup after we get this PR in

@mwang1026 are you good with this given the recent updates based on your feedback? Remaining open question is if you also want a blurb in https://www.cockroachlabs.com/docs/v24.1/cockroachdb-feature-availability.html#features-in-preview or if you'd rather this feature did not show up there. I believe our practice is to also list it there but maybe you don't want it there, idk

rmloveland avatar May 13 '24 20:05 rmloveland

@mwang1026 I went ahead and added WAL failover to the list of preview features since AFAICT we do that for everything else in Preview

Let me know if you're good with the other changes and I'll send this along for docs team review so I can get it merged ASAP

Thanks!

rmloveland avatar May 14 '24 15:05 rmloveland

@florence-crl this is RFAL from a docs POV now

in terms of sequencing, this should go in first, then #18548

rmloveland avatar May 15 '24 14:05 rmloveland

@florence-crl thanks for the helpful review. I've incorporated everything from your first pass AFAICT - PTAL!

rmloveland avatar May 16 '24 15:05 rmloveland