docs
docs copied to clipboard
Add docs on storage engine WAL failover
Fixes:
- DOC-9709
- DOC-9916
- DOC-9925
- DOC-10149
Summary of changes:
-
Add a new section to
cockroach startdescribing the WAL failover feature, how to enable/disable, and the related logging config changes that are needed if you enable the feature -
Add a new section to 'Monitoring and Alerting' docs describing the store status endpoint at
_status/stores -
Update logging docs to add some anchor links so we can refer to specific config settings from the WAL failover docs
-
Update v24.1 alpha release notes to link to the WAL failover docs
Files changed:
- src/current/_includes/releases/v24.1/v24.1.0-alpha.4.md:
- src/current/v24.1/cockroach-start.md
- src/current/v24.1/cockroachdb-feature-availability.md
- src/current/v24.1/monitoring-and-alerting.md
Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 094dd7f2a8aca3c2436a8c1f94038613649d4586 |
| Latest deploy log | https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/66464b647ffaa000081bbc06 |
Deploy Preview for cockroachdb-api-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 094dd7f2a8aca3c2436a8c1f94038613649d4586 |
| Latest deploy log | https://app.netlify.com/sites/cockroachdb-api-docs/deploys/66464b643600430008f637a4 |
Netlify Preview
| Name | Link |
|---|---|
| Latest commit | 094dd7f2a8aca3c2436a8c1f94038613649d4586 |
| Latest deploy log | https://app.netlify.com/sites/cockroachdb-docs/deploys/66464b648dbce800093953e1 |
| Deploy Preview | https://deploy-preview-18511--cockroachdb-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Hi folks, I added each of you to review for the following reasons/areas, but please feel free to comment on anything you see that is missing/incorrect/etc as well:
- @jbowens for WAL failover stuff
- @abarganier for async file logging stuff
- @mwang1026 for overall vibe
A few minor edit suggestions -- notably a few more places I think we should mention that feature is in
PREVIEW
Updated in the places where you mentioned. Do we also want to add this to the list of Features in Preview for v24.1? My assumption is yes but figured I'd ask while you're here
I think we should also open a separate PR for how to monitor for failover -- the metrics to watch, how to inspect them, etc. -- thoughts?
That makes sense, I can make a followup PR - @jbowens I found the following metrics on the custom chart debug page of a 24.1 RC cluster. Which ones do you think make sense to monitor and what values should one alert on? Based on looking I'd guess switch.count could be a starting point? followed by the durations? but I'm just guessing :-)
- storage.wal.failover.primary.duration
- storage.wal.failover.secondary.duration
- storage.wal.failover.switch.count
- storage.wal.failover.write_and_sync.latency-avg
- storage.wal.failover.write_and_sync.latency-count
- storage.wal.failover.write_and_sync.latency-max
- storage.wal.failover.write_and_sync.latency-sum
- storage.wal.failover.write_and_sync.latency-p50
- storage.wal.failover.write_and_sync.latency-p75
- storage.wal.failover.write_and_sync.latency-p90
- storage.wal.failover.write_and_sync.latency-p99
- storage.wal.failover.write_and_sync.latency-p99.9
- storage.wal.failover.write_and_sync.latency-p99.99
- storage.wal.failover.write_and_sync.latency-p99.999
Which ones do you think make sense to monitor and what values should one alert on? Based on looking I'd guess switch.count could be a starting point? followed by the durations? but I'm just guessing :-)
Yeah, I think it makes sense to document those first three metrics:
storage.wal.failover.primary.duration
storage.wal.failover.secondary.duration
storage.wal.failover.switch.count
The storage.wal.failover.secondary.duration is probably the most interesting. Customers will generally expect this to be zero unless there's a failover. Then they might care about how long it remains non-zero because it provides indication into the health of the primary.
Yeah, I think it makes sense to document those first three metrics:
storage.wal.failover.primary.duration storage.wal.failover.secondary.duration storage.wal.failover.switch.countThe
storage.wal.failover.secondary.durationis probably the most interesting. Customers will generally expect this to be zero unless there's a failover. Then they might care about how long it remains non-zero because it provides indication into the health of the primary.
Thanks @jbowens - I've filed https://cockroachlabs.atlassian.net/browse/DOC-10268 and will do that as a followup after we get this PR in
@mwang1026 are you good with this given the recent updates based on your feedback? Remaining open question is if you also want a blurb in https://www.cockroachlabs.com/docs/v24.1/cockroachdb-feature-availability.html#features-in-preview or if you'd rather this feature did not show up there. I believe our practice is to also list it there but maybe you don't want it there, idk
@mwang1026 I went ahead and added WAL failover to the list of preview features since AFAICT we do that for everything else in Preview
Let me know if you're good with the other changes and I'll send this along for docs team review so I can get it merged ASAP
Thanks!
@florence-crl this is RFAL from a docs POV now
in terms of sequencing, this should go in first, then #18548
@florence-crl thanks for the helpful review. I've incorporated everything from your first pass AFAICT - PTAL!