cockroach
cockroach copied to clipboard
Panic when running UNION on SHOW BACKUP queries as non-root user
Describe the problem
If a query issued by a non-root user UNIONs the result of multiple SHOW BACKUP queries, then CRDB crashes with an exit code of 2.
To Reproduce
- Create a user and grant perms.
root@localhost:26257/defaultdb> create user test;
CREATE ROLE
Time: 25ms total (execution 25ms / network 0ms)
root@localhost:26257/defaultdb> grant admin to test;
GRANT
Time: 29ms total (execution 29ms / network 0ms)
- Take a backup
test@localhost:26257/defaultdb> BACKUP INTO 'userfile:///backups';
job_id | status | fraction_completed | rows | index_entries | bytes
---------------------+-----------+--------------------+------+---------------+--------
798619757264404481 | succeeded | 1 | 24 | 9 | 1636
(1 row)
Time: 466ms total (execution 466ms / network 0ms)
test@localhost:26257/defaultdb> SHOW BACKUPS in 'userfile:///backups';
path
-------------------------
/2022/09/21-195008.34
(1 row)
Time: 23ms total (execution 23ms / network 0ms)
- Union the result of 2 SHOW BACKUP queries, this might take a few tries to get the race condition to occur.
test@localhost:26257/defaultdb> SELECT * FROM [SHOW BACKUP LATEST IN 'userfile:///backups'] UNION SELECT * FROM [SHOW BACKUP LATEST IN 'userfile:///backups'];
ERROR: unexpected EOF
warning: error retrieving the transaction status: connection closed unexpectedly: conn closed
warning: connection lost!
opening new connection: all session settings will be lost
warning: error retrieving the database name: failed to connect to `host=localhost user=test database=`: dial error (dial tcp [::1]:26257: connect: connection refused)
- CRDB should have crashed. The interesting stack traces should be the ones that include the backup call.
I220921 13:50:12.719035 1 util/log/flags.go:211 [-] 1 stderr capture started
fatal error: concurrent map read and map write
goroutine 11141 [running]:
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.(*StoredCatalog).UpdateValidationLevel(0xc005dc0580, {0x7fd447d2a088, 0xc0000d5ef0}, 0x7)
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/stored_catalog.go:374 +0x94
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).finalizeDescriptors(0xc005dc04e0, {0x5f95168, 0xc001b42720}, 0x0?, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, ...)
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:395 +0x45b
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getByName(0xc005dc04e0, {0x5f95168, 0xc001b42720}, 0xc0028de2c0, {0x0?, 0x0?}, {0x0?, 0x0?}, {0x510253f, 0x6}, ...)
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:324 +0x5c5
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getDatabaseByName(0x203000?, {0x5f95168?, 0xc001b42720?}, 0x0?, {0x510253f, 0x6}, {0x0, 0x0, 0x0, 0x0, ...})
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/database.go:64 +0x88
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetImmutableDatabaseByName(...)
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/database.go:45
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getObjectByNameIgnoringRequiredAndType(0xc005dc04e0, {0x5f95168, 0xc001b42720}, 0xc005868f58?, {0x510253f, 0x6}, {0x5101fc9, 0x6}, {0x511b386, 0xc}, ...)
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/object.go:160 +0x13d
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getObjectByName(0x2?, {0x5f95168, 0xc001b42720}, 0x0?, {0x510253f?, 0xc45f80?}, {0x5101fc9?, 0x1945786?}, {0x511b386, 0xc}, ...)
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/object.go:65 +0x1b8
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getTableByName(0x1000000000000?, {0x5f95168, 0xc001b42720}, 0x17303c0?, {0x5fc7720, 0x9809800}, {{0x1, 0x0, 0x0, 0x0, ...}, ...})
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/table.go:53 +0x119
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetImmutableTableByName(0xc002e9ce90?, {0x5f95168?, 0xc001b42720?}, 0xc003207400?, {0x5fc7720?, 0x9809800?}, {{0x1, 0x0, 0x0, 0x0, ...}, ...})
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/table.go:44 +0x4e
github.com/cockroachdb/cockroach/pkg/sql.MemberOfWithAdminOption({0x5f95168?, 0xc001b42720}, 0xc0013c0a00, {0x5fe4430?, 0xc0005a33b0}, 0xc0058690a8?, 0xc0028de2c0, {{0xc00180c040?, 0x0?}})
github.com/cockroachdb/cockroach/pkg/sql/authorization.go:462 +0x14c
github.com/cockroachdb/cockroach/pkg/sql.(*planner).MemberOfWithAdminOption(0x0?, {0x5f95168?, 0xc001b42720?}, {{0xc00180c040?, 0x0?}})
github.com/cockroachdb/cockroach/pkg/sql/authorization.go:433 +0x65
github.com/cockroachdb/cockroach/pkg/sql.(*planner).UserHasAdminRole(0x0?, {0x5f95168?, 0xc001b42720?}, {{0xc00180c040?, 0x0?}})
github.com/cockroachdb/cockroach/pkg/sql/authorization.go:393 +0x88
github.com/cockroachdb/cockroach/pkg/cloud/cloudprivilege.CheckDestinationPrivileges({0x5f95168, 0xc001b42720}, {0x6025240, 0xc001b1d0d0}, {0xc000529820, 0x1, 0x0?})
github.com/cockroachdb/cockroach/pkg/cloud/cloudprivilege/privileges.go:29 +0x94
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.showBackupPlanHook.func1({0x5f95168, 0xc0083a2db0}, {0x0?, 0x0?, 0xf0f00c00a3cbf68?}, 0x5f95168?)
github.com/cockroachdb/cockroach/pkg/ccl/backupccl/show.go:279 +0x30a
github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec.func1()
github.com/cockroachdb/cockroach/pkg/sql/planhook.go:184 +0xa4
created by github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec
github.com/cockroachdb/cockroach/pkg/sql/planhook.go:182 +0x1b2
Expected behavior
There should be no crash!
Additional data / screenshots
Log directory: https://upload.cockroachlabs.com/receive/?packageCode=M1KruQE98L65Cm1oy6GPshIXN29XnAdZeaRNvTreTgw#keycode=GgD909dEchER8yB1YbmTUiksTlrL24aaRknToo7MrXk
Environment:
Build Tag: v22.2.0-alpha.3-265-gedeefa9120
Build Time: 2022/09/20 07:10:37
Distribution: CCL
Platform: linux amd64 (x86_64-pc-linux-gnu)
Go Version: go1.19.1
C Compiler: gcc 6.5.0
Build Commit ID: edeefa9120e86739808ca43fc52d4cbac813fab3
Build Type: release
Additional context
This was discovered while running some performance tests on backup queries.
Jira issue: CRDB-19797
Thanks for reporting this issue, it's a very interesting bug.
I think the fundamental problem here is that:
- these SHOW BACKUP statements run in a
hookFnNode
plan node, because that's how CCL statements are handled, - their
startExec
methods spawn a goroutine to do the work - the work involves accessing the collection and doing stuff that's generally not thread-safe
- this doesn't have an impact most of the time because we usually don't deal with multiple CCL statements concurrently
- except here and what's still a mystery to me is why didn't this fail earlier!
In any case synchronizing the goroutines to make sure they run in a mutually-exclusive manner for each transaction seems to be the correct thing to do.
Okay, this one is good!
The root cause, in effect, is the introduction in 22.2 to check the privilege on the cloud URL here: https://github.com/cockroachdb/cockroach/blob/ef4656fc14155afd31a486b38f688c0dee08ca0b/pkg/ccl/backupccl/show.go#L279-L281
The problem is that this privilege check happens in the returned callback and not in the setup function. The returned callback is called in parallel with other parts of execution, and is not safe to concurrently resolve descriptors. Resolving descriptors needs to happen during the planning part of the plan hook.
Now, there's some complexity here in two dimensions:
- We do some lazy computation I don't fully understand to resolve the urls into strings and we do that in the async callback. We'll need to figure out how to get the URLs we want to authenticate before the callback gets run -- or we'll need to resolve the complete set of privileges earlier
- The privilege checking code is far too lazy and composed. I filed an issue about this in https://github.com/cockroachdb/cockroach/issues/88400. The problem is that we reach back around to the planner from the catalog layer in an unsatisfying way that hides access to a non-thread-safe resource. What we should do instead, as described in that issue, is explicitly resolve the ExternalConnectionPrivileges into a privilege descriptor. We should do that in the plan hook itself and not in the returned closure.
If we did that, then we'd be accessing the collection during planning time and not execution time, as it was intended. I don't have enough context above on how these inCol
and toFn
callbacks work, so we'll need some help from @cockroachdb/disaster-recovery (maybe @adityamaru or @dt) to figure out how to get our hands on URLs to authorize during planning and then @RichardJCai, your ongoing work should make it straightforward to lift this privilege checking.
cc @cockroachdb/disaster-recovery
figure out how to get our hands on URLs to authorize during planning
I believe this is impossible: URIs can appear in placeholders, so we don't have their values until eval?
It seems to me like if we lease the relevant descriptors (which are known and finite) then, if I understand correctly, we won't need to resolve anything new during execution.
How do we know the descriptors though? like, just lease all possible synthetic / privilege descriptors? Until we look at the URI, we don't know if it is a named sink, or a userfile? If it ends up userfile, we need to do priv checks on the table being written to, which can be any table name so it seems like we'd need to have leased every table?
The synthetic privileges aren't themselves descriptors. They are data in a table and there's a cache associated with that data. We bump the table descriptor version for the table when we write to it in order to achieve cache coherence. This is the same pattern followed also for users and user_roles. There's a constant number of descriptors to lease (2?)
But what if the uri is to a userfile location? Until we parse the host we don't even know what table name to resolve let along priv check
But what if the uri is to a userfile location? Until we parse the host we don't even know what table name to resolve let along priv check
We always have to resolve the same two tables system.privileges
and system.external_connections
But what if the uri is to a userfile location? Until we parse the host we don't even know what table name to resolve let along priv check
Okay, that's a problem. I think @dt is noting that the issue here is that we also need to resolve the userfile table. I don't really know how that happens. Fundamentally, we have an issue here. The planning layer is not safe for concurrent access and the catalog is part of the planning layer.
The way that the optimizer works here is that it stores all of the dependencies in its prepared plan and then it checks those dependencies and re-plans in the context of the planner. I think the good, or bad, news is that we already had the userfile bug and it's pretty deep, and people don't seem to use userfile much, let alone in the context of a UNION
in SHOW BACKUPS
. If they did, they'll hit this bug. I think we're better off starting with a partial fix for the common case and then coming back to evaluate this more general, but nevertheless severe problem.