cockroach GRANT hanging, login attempt causes instability

Describe the problem Maybe related to https://github.com/cockroachdb/cockroach/issues/16106

I ran a GRANT SELECT ON mydb.myschema.* TO myuser to give select privileges to a user on all tables in a given database. That grant statement never finished, and in the jobs page I see a job that says "updating privileges for table 224" that's been there for several days. According to show jobs it's been attempted 12 times. In the logs on the coordinator I see this:

/var/log/cockroach2/cockroach.log:W220808 14:39:40.074802 615536578 sql/schema_changer.go:2679 ⋮ [n3,job=785306286124433455] 7662771  error while running schema change, retrying: ‹batch timestamp 1636385765.078926224,0 must be after replica
 GC threshold 1652577945.545458359,0›

What's extra unfortunate is what happened when I tried to log in with the user being granted permissions. I got ERROR: internal error while retrieving user account memberships: operation "get-user-timeout" timed out after 10s (given timeout 10s): internal error while retrieving user account: context deadline exceeded -- and then the cluster started behaving really wonky, I couldn't log in as other users and the db web console stopped responding. The nodes were spitting this out in their logs:

W220808 14:33:35.519990 589026460 sql/user.go:237 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202195  user lookup for ‹"walker"› failed: context deadline exceeded
W220808 14:33:35.520115 589026460 sql/user.go:150 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202196  user membership lookup for ‹"walker"› failed: ‹operation "get-user-timeout" timed out after 10s (given timeout 10s)›: internal error
while retrieving user account: context deadline exceeded
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197  user retrieval failed for user=‹"walker"›: internal error while retrieving user account memberships: ‹operation "get-user-time
out" timed out after 10s (given timeout 10s)›: internal error while retrieving user account: context deadline exceeded
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +(1) attached stack trace
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  -- stack trace:
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   github.com/cockroachdb/cockroach/pkg/sql/user.go:151
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).handleAuthentication
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth.go:144
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:689
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | runtime.goexit
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   GOROOT/src/runtime/asm_amd64.s:1581
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +Wraps: (2) internal error while retrieving user account memberships
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +Wraps: (3) ‹operation "get-user-timeout" timed out after 10s (given timeout 10s)›: internal error while retrieving user accoun
t: context deadline exceeded
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | -- cause hidden behind barrier
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | ‹operation "get-user-timeout" timed out after 10s (given timeout 10s)›: internal error while retrieving user account: cont
ext deadline exceeded
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | (1) ‹operation "get-user-timeout" timed out after 10s (given timeout 10s)›
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | Wraps: (2) attached stack trace
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   -- stack trace:
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       github.com/cockroachdb/cockroach/pkg/sql/user.go:238
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func2
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       github.com/cockroachdb/cockroach/pkg/sql/user.go:121
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | github.com/cockroachdb/cockroach/pkg/sql.getUserInfoRunFn.func2
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       github.com/cockroachdb/cockroach/pkg/sql/user.go:194
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       github.com/cockroachdb/cockroach/pkg/sql/user.go:118
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).handleAuthentication
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth.go:144
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:689
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | runtime.goexit
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   |       GOROOT/src/runtime/asm_amd64.s:1581
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | Wraps: (3) internal error while retrieving user account
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | Wraps: (4) context deadline exceeded
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | -- cause hidden behind barrier
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | context deadline exceeded
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | (1) context deadline exceeded
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  |   | Error types: (1) context.deadlineExceededError
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +  | Error types: (1) *contextutil.TimeoutError (2) *withstack.withStack (3) *errutil.withPrefix (4) *barriers.barrierErr
W220808 14:33:35.520256 589026460 sql/pgwire/auth.go:152 ⋮ [n2,client=10.240.241.22:53936,user=‹walker›] 4202197 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *barriers.barrierErr

Since this made the cluster unstable I had to act immediately, and restart one node seemed to resolve the issue, so I'm rolling the rest of the cluster too. I am able to login now with the usual users, but I'm hesitant to even try logging in again as the user I was trying to grant permissions to, lest it blow up again. Thankfully I don't need that user to work at this moment.

To Reproduce

Just what I wrote above. The cluster has 39 nodes, ~500k ranges, and does 10-30k transactions/sec.

Environment:

CockroachDB version V22.1.5 -- this upgrade was recent, but unknown if the problem is related, have haven't done much with users/grants in a while.
Server OS: Centos 7

Jira issue: CRDB-18415

Aug 08 '22 14:08 dankinder

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/sql-schema (found keywords: schema change)
@bdarnell (author of cockroachdb/cockroach#16106)
@cockroachdb/cdc (found keywords: show jobs)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Aug 08 '22 14:08 blathers-crl[bot]

I just realized that there are 2 other jobs failing, so their failure could be related.

One is an REFRESH MATERIALIZED VIEW, the other is "Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges", both are in a RETRYING state.

The latter job has this error in the jobs table:

{"running execution from '2022-08-06 05:10:20.186118' to '2022-08-06 05
:10:23.427546' on 55 failed: non-cancelable: running migration for 21.2-56: running iterate callback: could not dial n69: failed to connect to n69 at seu-pid221.pdx.turnitin.io:26258: initial connection heartbeat failed: rpc error: code = U
navailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.240.4.41:26258: connect: connection refused\"","running execution from '2022-08-06 22:14:37.93927' to '2022-08-06 22:14:38.041508' on 52 failed: non-can
celable: running migration for 21.2-56: running iterate callback: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.PerReplica/WaitForApplication not available","running execution from '2022-08-07 22:14:50.41053
3' to '2022-08-07 22:14:50.487196' on 3 failed: non-cancelable: running migration for 21.2-56: running iterate callback: rpc error: code = Unavailable desc = node waiting for init; /cockroach.storage.PerReplica/WaitForApplication not availa
ble"}

The node referenced in that message is one I had to decom on Friday because of a separate issue. I was going to file it as it's own CRDB bug today but didn't have time. That node crashed and wouldn't come back because of the following message:

ERROR: startup forbidden by prior critical alert
DETAIL: From /srv/disk5/cockroach-data2/auxiliary/_CRITICAL_ALERT.txt:
ATTENTION:
this node is terminating because a replica inconsistency was detected between [n69,s413,r456341/79:/Table/75/2/"dlib.si"/"www"/"…]
and its other replicas. Please check your cluster-wide log files for more
information and contact the CockroachDB support team. It is not necessarily safe
to replace this node; cluster data may still be at risk of corruption.
A checkpoints directory to aid (expert) debugging should be present in:
/srv/disk5/cockroach-data2/auxiliary
A file preventing this node from restarting was placed at:
/srv/disk5/cockroach-data2/auxiliary/_CRITICAL_ALERT.txt

I decommed that node so maybe the "Migration to 21.2-56" job will succeed next time it tries.

Aug 08 '22 15:08 dankinder

Thanks for this report. Out of curiosity, how many roles/users do you have in your cluster?

Aug 08 '22 18:08 rafiss

Most created a year ago or more.

Aug 08 '22 18:08 dankinder

I have a debug zip if you want it, thing is, there are 2 nodes that won't have logs included. One is the one that died last week and is reporting the "startup forbidden by prior critical alert". The other is a node that is still misbehaving as of this morning, it starts but seems to not come up. To provide logs for those i will need to manually zip up the log directories or whatever.

Aug 08 '22 19:08 dankinder

Another node just died and won't start back up with this message:

ERROR: startup forbidden by prior critical alert
DETAIL: From /srv/disk4/cockroach-data2/auxiliary/_CRITICAL_ALERT.txt:
ATTENTION:
this node is terminating because a replica inconsistency was detected between [n51,s304,r465117/63:/Table/75/2/"fatih.edu.tr"/"i…]
and its other replicas. Please check your cluster-wide log files for more
information and contact the CockroachDB support team. It is not necessarily safe
to replace this node; cluster data may still be at risk of corruption.
A checkpoints directory to aid (expert) debugging should be present in:
/srv/disk4/cockroach-data2/auxiliary
A file preventing this node from restarting was placed at:
/srv/disk4/cockroach-data2/auxiliary/_CRITICAL_ALERT.txt

Before I lost too many nodes here I'm going to need to wipe these ones and add them back into the cluster as fresh nodes

Aug 09 '22 13:08 dankinder

This may help explain why that migration isn't finishing:

I220809 19:33:30.973834 96742 migration/migrationmanager/manager.go:372 ⋮ [n47,intExec=‹set-version›,migration-mgr] 109  found existing migration job 785222573203750976 for version 21.2-56 in status running, waiting
I220809 19:33:30.973998 104444 jobs/adopt.go:240 ⋮ [n47,intExec=‹set-version›,migration-mgr] 110  job 785222573203750976: resuming execution
E220809 19:33:30.978551 104444 jobs/adopt.go:206 ⋮ [n47,intExec=‹set-version›,migration-mgr] 111  could not run claimed job 785222573203750976: job 785222573203750976: claim with session id 2ffc9912fd874d70aa75ada1bd19be01 does not exist

Aug 09 '22 19:08 dankinder

I think what's going on is that the job is failing due to https://github.com/cockroachdb/cockroach/issues/84073 and then it backs off. You can PAUSE JOB 785222573203750976 / RESUME JOB 785222573203750976 to reset the backoff. The next point release in 22.1 has robustness and checkpointing improvements for that migration. For large clusters with many ranges, we've observed that migration being particularly fragile.

Aug 10 '22 21:08 ajwerner

FWIW, I don't think the migration is particularly related to the grant issues or to your inconsistency issues AFAIK.

Aug 10 '22 21:08 ajwerner

I suspect that whatever was causing the cluster inconsistency is the thing that is making GRANT hang. The part of logging in that was timing out is loading the role hierarchy, and I suppose it's also possible that the inconsistency could affect that operation too.

The node referenced in that message is one I had to decom on Friday because of a separate issue. I was going to file it as it's own CRDB bug today but didn't have time. That node crashed and wouldn't come back

I think at this point it could be worth filing that separate issue to try to figure out if there was a bug. We may not be able to investigate it much, but it still could be useful to have that issue for reference.

Aug 10 '22 21:08 rafiss

Created https://github.com/cockroachdb/cockroach/issues/85985

Aug 11 '22 19:08 dankinder

I'll close this based on the conversation in the linked issue.

Sep 09 '22 02:09 rafiss