citus
citus copied to clipboard
7244/multi db connection improvements
This PR introduces the following changes:
- Makes
citus.max_shared_pool_size
cluster-wide. Introduces newcitus.max_databases_per_worker_tracked
GUC to adapt theshared_connection_stats
to this change - Introduces new "maintenance" connections and a dedicated connection pool for them. New pool is configured by new GUC
citus.max_maintenance_shared_pool_size
, which is equal to 5 by default.
~1. Makes transaction recovery and distributed deadlocks detection respect citus.max_shared_pool_size
~
~1. Adds citus.shared_pool_size_maintenance_quota
GUC to reserve a fraction of citus.max_shared_pool_size
for maintenance operations exclusively. So, by default, regular and maintenance operations have isolated fractions of the pool and do not compete with each other~
~1. Adds the possibility to assign one database as dedicated for maintenance via new citus.maintenance_management_database
GUC. I'll integrate it later with #7254 if it is merged~
~1. Disables connections caching for all maintenance backends, except the one for citus.maintenance_management_database
~
~1. Makes distributed deadlocks detection be invoked only on citus.maintenance_management_database
, when it is set.~
~When citus.maintenance_management_database
is not set (default), all maintenance daemons perform distributed deadlocks detection, but they never cache connections and respect the citus.shared_pool_size_maintenance_quota
or citus.max_shared_pool_size
.~
Logic, related to citus.main_db
will be introduced in a separate PR.
This is a draft implementation to check if I am moving in the right direction, so there is only the base scenario covered by test. I also haven't found an example of distributed deadlocks detection test, so I'll appreciate some advice on this, as well as how to better address the @onderkalaci's concern:
I think even the latter is non-trivial to implement, there are bunch of areas needs to be thought of well, such as make sure that this is not broken: https://github.com/citusdata/citus/blob/main/src/backend/distributed/connection/locally_reserved_shared_connections.c or https://github.com/citusdata/citus/blob/main/src/backend/distributed/executor/adaptive_executor.c#L4707
cc @marcocitus
closes #7244
Makes transaction recovery and distributed deadlocks detection respect citus.max_shared_pool_size
I think this is probably the more controversial / hazardous part of this PR. It is required for stability that max_shared_pool_size >= max_client_connections, otherwise not every client connection can get a slot. We generally recommend max_client_connections == max_shared_pool_size such that you will not make more outbound connections per worker than you receive in inbound client connections, which means there is no slack in the max_shared_pool_size in most set ups.
If we change the semantics of max_shared_pool_size, we might risk a lot of set ups becoming unstable unless they set max_shared_pool_size >= max_client_connections + # reserved maintenance connections
Adds the possibility to assign one database as dedicated for maintenance via new citus.maintenance_management_database GUC. I'll integrate it later with https://github.com/citusdata/citus/pull/7254 if it is merged Disables connections caching for all maintenance backends, except the one for citus.maintenance_management_database Makes distributed deadlocks detection be invoked only on citus.maintenance_management_database, when it is set.
Could this part be a separate PR? It probably doesn't require much discussion and #7254 is merged now.
If we change the semantics of max_shared_pool_size, we might risk a lot of setups becoming unstable unless they set max_shared_pool_size >= max_client_connections + # reserved maintenance connections
I see your point, but it is still a more predictable way to manage a cluster, I think. Currently, DBAs just have to keep in mind that there are some operations that require additional connections, so max_connections
is typically slightly larger than max_shared_pool_size + max_client_connections
. With this PR, the "slightly larger" part just becomes "official": configurable and isolated into its own part of max_shared_pool_size
.
And, of course, it is always possible to implement a backup feature flag :slightly_smiling_face:
Could this part be a separate PR?
I'd have to move those points as well, then:
- Disables connections caching for all maintenance backends, except the one for citus.maintenance_management_database
- Makes distributed deadlocks detection be invoked only on citus.maintenance_management_database, when it is set.
If it is okay, then no problem :slightly_smiling_face:
I see your point, but it is still a more predictable way to manage a cluster,
Perhaps, if we were starting from scratch. In practice, due to a large user / customer base, and platforms that need to support a large number of version combinations, we try to avoid breaking backwards compatibility and/or introducing custom upgrade steps whenever possible.
An alternative option could be to have 2 separate pool settings, e.g. max_maintenance_pool_size & max_shared_pool_size. The former could default to some small number (e.g. 5), to keep the total number of outbound connections predictable.
If it is okay, then no problem 🙂
Yup
Sounds reasonable, and shouldn't be too hard to implement...
Ok, I'll be back with that and a new PR for main_db
then :slightly_smiling_face:
@marcoslot, done, now it is a separate pool :slightly_smiling_face:
@onderkalaci, kind remainder about this PR :slightly_smiling_face: I noticed that @marcoslot is at Crunchy Data now, so wanted to make sure that the PR won't be abandoned due to organizational changes
Well, that's awkward... @JelteF, maybe you could help :) ?
I'll try to take a look at this soon, thank you for your work on this.
Could you run citus_indent
on your PR? It's hard for me to see what code is actually changed because some tabs have been changed to spaces. Instructions to run it can be found here: https://github.com/citusdata/citus/blob/main/CONTRIBUTING.md#following-our-coding-conventions
FYI I pushed a commit to your branch to reindent it.
@JelteF Great, thanks! I'll address your comments shortly. Test fails doesn't seem to be caused by my changes, right? But the branch should be synced with main anyway, may be it will help :man_shrugging:
I also concerned with the test coverage: as I mentioned, I covered only basic scenario. Maybe, there are some use-cases that should be also covered? Especially the distributed deadlock detection, which I am not sure how to test with the existing tool set
The build error on PG16 seems to be because you included #include "distributed/connection_management.h"
after pg_versioncompat.h
in maintenanced.c
. (this is arguably a bug in in how we define has_createdb_privilege in pg_versioncompat.h
, but it's only being triggered because of your changes).
@JelteF, fixed pg16 build and addressed the review.
The #define have_createdb_privilege() have_createdb_privilege()
in pg_version_compat.h
seems to be unnecessary, and it breaks the build if included before commands/dbcommands.h
. Seems to work fine without it
I also experienced some flakiness with multi_extension
, is this expected?
Codecov Report
Attention: Patch coverage is 90.58296%
with 21 lines
in your changes missing coverage. Please review.
Project coverage is 81.44%. Comparing base (
2874d7a
) to head (a227bcc
). Report is 14 commits behind head on main.
:exclamation: Current head a227bcc differs from pull request most recent head bf2572a
Please upload reports for the commit bf2572a to get more accurate results.
:exclamation: There is a different number of reports uploaded between BASE (2874d7a) and HEAD (a227bcc). Click for more details.
HEAD has 98 uploads less than BASE
Flag BASE (2874d7a) HEAD (a227bcc) 15_regress_check-pytest 2 0 15_regress_check-enterprise-isolation-logicalrep-3 2 0 16_regress_check-enterprise-failure 2 0 14_regress_check-vanilla 2 0 15_regress_check-vanilla 2 0 15_regress_check-multi-mx 2 0 15_regress_check-enterprise-isolation-logicalrep-2 2 0 14_regress_check-operations 2 0 16_regress_check-operations 2 0 16_regress_check-columnar 2 0 14_regress_check-split 2 0 15_regress_check-columnar 2 0 16_regress_check-enterprise 4 0 _upgrade 21 0 14_16_upgrade 1 0 15_regress_check-columnar-isolation 1 0 14_regress_check-follower-cluster 2 0 16_regress_check-query-generator 1 0 15_regress_check-follower-cluster 1 0 15_regress_check-enterprise-isolation-logicalrep-1 1 0 14_regress_check-query-generator 1 0 16_regress_check-multi-mx 1 0 14_regress_check-enterprise-isolation-logicalrep-2 1 0 16_regress_check-failure 1 0 16_regress_check-columnar-isolation 1 0 16_regress_check-enterprise-isolation 1 0 15_16_upgrade 1 0 15_regress_check-enterprise-isolation 1 0 14_regress_check-enterprise-isolation 1 0 16_regress_check-isolation 1 0 14_regress_check-isolation 1 0 16_regress_check-pytest 1 0 14_regress_check-pytest 1 0 16_regress_check-multi 1 0 15_regress_check-query-generator 1 0 15_regress_check-enterprise-failure 1 0 14_regress_check-enterprise-failure 1 0 15_regress_check-enterprise 1 0 14_regress_check-multi 1 0 15_regress_check-isolation 1 0 15_regress_check-multi-1 1 0 14_regress_check-multi-1 1 0 14_regress_check-failure 1 0 15_regress_check-operations 1 0 14_regress_check-columnar-isolation 1 0 15_regress_check-split 1 0 16_regress_check-follower-cluster 1 0 16_regress_check-enterprise-isolation-logicalrep-2 1 0 14_regress_check-columnar 1 0 16_regress_check-split 1 0 16_regress_check-enterprise-isolation-logicalrep-3 1 0 14_regress_check-enterprise 1 0 16_regress_check-vanilla 1 0 16_cdc_installcheck 1 0 14_15_upgrade 1 0 15_cdc_installcheck 1 0 16_regress_check-enterprise-isolation-logicalrep-1 1 0 14_regress_check-multi-mx 1 0 14_regress_check-enterprise-isolation-logicalrep-3 1 0 15_regress_check-failure 1 0 14_regress_check-enterprise-isolation-logicalrep-1 1 0 15_regress_check-multi 1 0
Additional details and impacted files
@@ Coverage Diff @@
## main #7286 +/- ##
==========================================
- Coverage 89.66% 81.44% -8.22%
==========================================
Files 283 282 -1
Lines 60525 59424 -1101
Branches 7540 7360 -180
==========================================
- Hits 54268 48399 -5869
- Misses 4102 8297 +4195
- Partials 2155 2728 +573
Tried to fix the check_gucs_are_alphabetically_sorted
and got a bit confused: it seems to be failing on citus.enable_repartitioned_insert_select
even on main
, on my machine. But it somehow passes on your CI :thinking:
ivan@ivan-pc:~/IdeaProjects/citus-fork$ ./ci/check_gucs_are_alphabetically_sorted.sh
sort: gucs.out:40: disorder: "citus.enable_repartitioned_insert_select",
I haven't looked at it in more detail yet, but at least the flaky test detection is failing because this test randomly fails:
+++ /__w/citus/citus/src/test/regress/results/multi_maintenance_multiple_databases.out.modified 2024-02-02 13:40:41.496394020 +0000
@@ -277,21 +277,21 @@
(SELECT setting::int FROM pg_settings WHERE name = 'port')),
$statement$
SELECT groupid, gid
FROM pg_dist_transaction
WHERE gid LIKE 'citus_0_1234_4_0_%'
OR gid LIKE 'citus_0_should_be_forgotten_%'
$statement$) AS t(groupid integer, gid text)
WHERE datname LIKE 'db%';
pg_dist_transaction_after_recovery_coordinator_test
-----------------------------------------------------
- t
+ f
(1 row)
SELECT count(*) = 0 AS cached_connections_after_recovery_coordinator_test
FROM pg_stat_activity
WHERE state = 'idle'
AND now() - backend_start > '5 seconds'::interval;
cached_connections_after_recovery_coordinator_test
----------------------------------------------------
@JelteF, is there a way I can download the results of failed tests?
Tried to reproduce locally but no use. But my tests time twice as long, so my theory is that it is reproducible on weaker machines
Yes if you go to the overview and scroll to the bottom you will see all the artifacts produced and also the diff of the failures inline: https://github.com/citusdata/citus/actions/runs/7812230917/attempts/2#summary-21309102210
The flaky tests you might be able to reproduce locally by running the test in question multiple times (this runs it 200 times):
src/test/regress/citus_tests/run_test.py multi_maintenance_multiple_databases --repeat 200
src/test/regress/citus_tests/run_test.py multi_maintenance_multiple_databases --repeat 200
Yep, that's what I tried :) It seems, it requires something more sophisticated...
I see this test also failed, normally I would attribute that to pre-existing flakiness. But since it's a background worker I'm wondering if it's related to your changes.
diff -dU10 -w /__w/citus/citus/src/test/regress/expected/background_task_queue_monitor.out /__w/citus/citus/src/test/regress/results/background_task_queue_monitor.out
--- /__w/citus/citus/src/test/regress/expected/background_task_queue_monitor.out.modified 2024-02-07 09:06:34.361655662 +0000
+++ /__w/citus/citus/src/test/regress/results/background_task_queue_monitor.out.modified 2024-02-07 09:06:34.373655599 +0000
@@ -488,25 +488,23 @@
t
(1 row)
CALL citus_wait_task_until_retried(:task_id1); -- shows that we increased retry count for task after failure
CALL citus_wait_task_until_retried(:task_id2); -- shows that we increased retry count for task after failure
SELECT task_id, status, retry_count, message FROM pg_dist_background_task
WHERE task_id IN (:task_id1, :task_id2)
ORDER BY task_id; -- show that all tasks are runnable by retry policy after termination signal
task_id | status | retry_count | message
---------+--------+-------------+----------
- 1450019 | runnable | 1 | FATAL: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx)" due to administrator command+
- | | | CONTEXT: Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx) +
+ 1450019 | done | | SELECT 1+
| | |
- 1450020 | runnable | 1 | FATAL: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx)" due to administrator command+
- | | | CONTEXT: Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx) +
+ 1450020 | done | | SELECT 1+
| | |
(2 rows)
SELECT citus_job_cancel(:job_id1);
citus_job_cancel
------------------
(1 row)
SELECT citus_job_cancel(:job_id2);
@@ -525,22 +523,22 @@
citus_job_wait
----------------
(1 row)
SELECT task_id, status FROM pg_dist_background_task
WHERE task_id IN (:task_id1, :task_id2)
ORDER BY task_id; -- show that all tasks are cancelled
task_id | status
---------+--------
- 1450019 | cancelled
- 1450020 | cancelled
+ 1450019 | done
+ 1450020 | done
(2 rows)
-- TEST9
-- verify that upon cancellation signal, all tasks are cancelled
BEGIN;
INSERT INTO pg_dist_background_job (job_type, description) VALUES ('test_job', 'simple test to verify cancellation on monitor') RETURNING job_id AS job_id1 \gset
INSERT INTO pg_dist_background_task (job_id, command) VALUES (:job_id1, $job$ SELECT pg_sleep(5); $job$) RETURNING task_id AS task_id1 \gset
INSERT INTO pg_dist_background_job (job_type, description) VALUES ('test_job', 'simple test to verify cancellation on monitor') RETURNING job_id AS job_id2 \gset
INSERT INTO pg_dist_background_task (job_id, command) VALUES (:job_id2, $job$ SELECT pg_sleep(5); $job$) RETURNING task_id AS task_id2 \gset
COMMIT;
The logs from the artifact from that session are these:
2024-02-07 09:05:36.785 UTC [1199] LOG: 00000: found scheduled background tasks, starting new background task queue monitor
2024-02-07 09:05:36.785 UTC [1199] CONTEXT: Citus maintenance daemon for database 16384 user 10
2024-02-07 09:05:36.785 UTC [1199] LOCATION: CitusMaintenanceDaemonMain, maintenanced.c:907
2024-02-07 09:05:36.788 UTC [17878] LOG: 00000: task jobid/taskid started: 1450011/1450019
2024-02-07 09:05:36.788 UTC [17878] CONTEXT: Citus Background Task Queue Monitor: regression
2024-02-07 09:05:36.788 UTC [17878] LOCATION: AssignRunnableTaskToNewExecutor, background_jobs.c:603
2024-02-07 09:05:36.789 UTC [17878] LOG: 00000: task jobid/taskid started: 1450012/1450020
2024-02-07 09:05:36.789 UTC [17878] CONTEXT: Citus Background Task Queue Monitor: regression
2024-02-07 09:05:36.789 UTC [17878] LOCATION: AssignRunnableTaskToNewExecutor, background_jobs.c:603
2024-02-07 09:05:37.773 UTC [17878] LOG: 00000: handling termination signal
2024-02-07 09:05:37.773 UTC [17878] CONTEXT: Citus Background Task Queue Monitor: regression
2024-02-07 09:05:37.773 UTC [17878] LOCATION: CitusBackgroundTaskQueueMonitorMain, background_jobs.c:1232
2024-02-07 09:05:37.773 UTC [17879] FATAL: 57P01: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (1450011/1450019)" due to administrator command
2024-02-07 09:05:37.773 UTC [17879] CONTEXT: Citus Background Task Queue Executor: regression/postgres for (1450011/1450019)
2024-02-07 09:05:37.773 UTC [17879] LOCATION: ProcessInterrupts, postgres.c:3199
2024-02-07 09:05:37.773 UTC [17880] FATAL: 57P01: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (1450012/1450020)" due to administrator command
2024-02-07 09:05:37.773 UTC [17880] CONTEXT: Citus Background Task Queue Executor: regression/postgres for (1450012/1450020)
2024-02-07 09:05:37.773 UTC [17880] LOCATION: ProcessInterrupts, postgres.c:3199
2024-02-07 09:05:37.773 UTC [17878] LOG: 00000: task jobid/taskid failed: 1450012/1450020
2024-02-07 09:05:37.773 UTC [17878] CONTEXT: Citus Background Task Queue Monitor: regression
2024-02-07 09:05:37.773 UTC [17878] LOCATION: ConsumeExecutorQueue, background_jobs.c:770
2024-02-07 09:05:37.779 UTC [1114] LOG: 00000: background worker "Citus Background Task Queue Executor: regression/postgres for (1450011/1450019)" (PID 17879) exited with exit code 1
2024-02-07 09:05:37.779 UTC [1114] LOCATION: LogChildExit, postmaster.c:3740
2024-02-07 09:05:37.779 UTC [1114] LOG: 00000: background worker "Citus Background Task Queue Executor: regression/postgres for (1450012/1450020)" (PID 17880) exited with exit code 1
2024-02-07 09:05:37.779 UTC [1114] LOCATION: LogChildExit, postmaster.c:3740
2024-02-07 09:05:37.779 UTC [17878] ERROR: 54000: invalid message size 21463928358196037 in shared memory queue
2024-02-07 09:05:37.779 UTC [17878] CONTEXT: Citus Background Task Queue Monitor: regression
2024-02-07 09:05:37.779 UTC [17878] LOCATION: shm_mq_receive, shm_mq.c:692
The invalid message size 21463928358196037 in shared memory queue
seems suspect, but maybe unrelated. Or a side effect of the termination, not the cause.
@JelteF
I see this test also failed
Tried to reproduce locally, but no use. I don't think it is related, since my changes mostly concentrated in the connection_management.c
, so it is a pretty long reach to background tasks. And even then, if interpreted correctly, the issue is that job is done
before it is terminated, so it is more likely that the problem is with citus_wait_task_until_retried
@JelteF, flakiness seems to be dealt with :slightly_smiling_face:
@JelteF, reminder about this PR, just in case :slightly_smiling_face:
I'll try to make some time soon to do a full review of this, I think it's pretty close to being mergable.
For now, could you look at the check-multi tests. Because they are timing out in CI.
I think it's pretty close to being mergable.
I still would like to add more test coverage, especially deadlock detection, but not sure how to do this with the existing toolkit. The scenario I have added covers only basic flow
but not sure how to do this with the existing toolkit.
I think the python test framework that we have would be a good fit for doing that. Readme and examples can be found here: https://github.com/citusdata/citus/tree/main/src/test/regress/citus_tests/test
For now, could you look at the check-multi tests. Because they are timing out in CI.
Not sure when I'll be able to get back to this, so just leave a note here: test are timing out because of low citus.max_maintenance_shared_pool_size
: some operations require more than one connection to complete, especially with the new MainDb logic which requires to extra maintenance connection for a SELECT citus_internal.commit_management_command_2pc()
call.
The proper fix would be to implement some API to reserve necessary amount of connections from the maintenance pool and use it for all maintenance operations, but it doesn't seem feasible. Instead it is better to introduce a simple timeout for acquiring connection for a maintenance pool with a meaningful error message: this way a user will be able to see that citus.max_maintenance_shared_pool_size
is too small and increase it accordingly. It will also be appropriate to disable the maintenance pool limit by default: issues with maintenance connections are more likely to arise with considerable amount of databases, when there are only few it may not be necessary.
The timeout is basically implemented, along with a fix of conditional variables for new pool, which I have missed before, but there are still issues related to #7203 and test coverage that should be improved