docs Guidance for automating pre-rolling upgrade checks

Jesse Seldess (jseldess) commented:

Our docs list various checks before starting a rolling upgrade, but these checks are manual. A customer requested guidance on automating these checks, a reasonable request. If possible, we should identify the right approach and document it. I think these are the relevant checks:

Make sure no bulk imports or schema changes are in progress.
Make sure all nodes are live or fully decommissioned.
Make sure no ranges are under-replicated or unavailable.
Make sure all nodes are on the same version of CockroachDB.
Make sure capacity and memory are ok (other issue to clarify this).

cc @dbist, @roncrdb, @BramGruneir

Jira Issue: DOC-395

Dec 11 '19 17:12 jseldess

cc @robert-s-lee, @glennfawcett, @a-entin

Dec 11 '19 17:12 dbist

Note that except for "make sure no bulk imports or schema changes are in progress", these are not specific to pre-upgrade checks. These are all things you should be monitoring continuously, and if they're not true the admin should be alerted. We should emphasize automating this by setting up continuous monitoring with prometheus, not a pre-upgrade script.

Regarding bulk imports and schema changes, I might remove this from the checklist. It's either going to be trivial to enforce (because it's the same people who would do the upgrade and apply schema changes), or impractical (because it's different people). If we think it's worthwhile to avoid these operations during an upgrade, we need to provide controls in the product to prevent them, not just a verbal guideline to check that nothing's happening.

Dec 16 '19 18:12 bdarnell

@BramGruneir, did you have any time to look into how to automate more of this?

Jan 07 '20 18:01 jseldess

Maybe next week. This week is too busy.

But yes, we should look into this.

On Tue, Jan 7, 2020 at 1:04 PM Jesse Seldess [email protected] wrote:

@BramGruneir https://github.com/BramGruneir, did you have any time to look into how to automate more of this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/docs/issues/6149?email_source=notifications&email_token=AAMKDONFYSIUMBDINOIZOHTQ4S72ZA5CNFSM4JZSZ3F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIJXYRQ#issuecomment-571702342, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMKDOKWTGTCMJZUWX7NHILQ4S72ZANCNFSM4JZSZ3FQ .

Jan 07 '20 20:01 BramGruneir

OK. Thanks, Bram.

Jan 07 '20 20:01 jseldess

Probably worth thinking about automating the checks between upgrading nodes as well: https://github.com/cockroachdb/docs/issues/6319

Jan 09 '20 20:01 jseldess

I started writing out some of the queries for this. Not sure if we want to automate with SQL or REST API but here's a start.

-- Node Status (should be zero)

select 'nodes draing or decommissioning' as status, count(*) as amt from crdb_internal.gossip_nodes n, crdb_internal.gossip_liveness l where n.node_id = l.node_id and n.is_live = true and ( l.draining = true or l.decommissioning = true); ;

-- Ranges Unavailable (should be zero)

select 'ranges unavailable' as status, sum((metrics->>'ranges.unavailable')::DECIMAL)::INT AS amt from crdb_internal.kv_store_status;

-- Under replicated (should be zero)

select 'ranges underreplicated' as status, sum((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS amt FROM crdb_internal.kv_store_status;

-- Running Jobs / Schema Changes (should be zero)

select 'active jobs' as status, count(*) as amt from crdb_internal.jobs where status in ('running','paused');

-- Same versions (should be 1)

select 'versions detected' as status, count(*) as amt from (select distinct server_version from crdb_internal.gossip_nodes);

-- Storage Capacity (ratio should be < .7)

select 'storage capacity', sum((metrics->>'capacity.used')::DECIMAL)::INT8 / sum((metrics->>'capacity.available')::DECIMAL)::INT8 as ratio from crdb_internal.kv_store_status;

-- Memory Capacity -- *** Need help on this one ***

-- CPU (should be < .5)

select 'normalized cpu', avg(cast( metrics->>'sys.cpu.combined.percent-normalized' as DECIMAL )) from crdb_internal.kv_node_status;

Jun 09 '20 03:06 chriscasano

Here's an update on the under replicated ranges query from here: https://github.com/cockroachdb/cockroach/issues/51304

SELECT sum((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS ranges_underreplicated FROM crdb_internal.kv_store_status JOIN crdb_internal.gossip_liveness USING (node_id) WHERE NOT decommissioning;

Jul 14 '20 18:07 chriscasano

The range reports are super useful here: https://www.cockroachlabs.com/docs/stable/query-replication-reports.html#system-replication_stats

This should show underreplicated as well.

On Tue, Jul 14, 2020 at 2:30 PM chriscasano [email protected] wrote:

Here's an update on the under replicated ranges query from here: cockroachdb/cockroach#51304 https://github.com/cockroachdb/cockroach/issues/51304

SELECT sum((metrics->>'ranges.underreplicated')::DECIMAL)::INT8 AS ranges_underreplicated FROM crdb_internal.kv_store_status JOIN crdb_internal.gossip_liveness USING (node_id) WHERE NOT decommissioning;

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/docs/issues/6149#issuecomment-658341444, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMKDOJPEB7QPC7N24OJPRTR3SP55ANCNFSM4JZSZ3FQ .

Jul 14 '20 21:07 BramGruneir

docs docs copied to clipboard

Guidance for automating pre-rolling upgrade checks

docs
docs copied to clipboard