citus icon indicating copy to clipboard operation
citus copied to clipboard

7244/multi db connection improvements

Open ivyazmitinov opened this issue 1 year ago • 32 comments

This PR introduces the following changes:

  1. Makes citus.max_shared_pool_size cluster-wide. Introduces new citus.max_databases_per_worker_tracked GUC to adapt the shared_connection_stats to this change
  2. Introduces new "maintenance" connections and a dedicated connection pool for them. New pool is configured by new GUC citus.max_maintenance_shared_pool_size, which is equal to 5 by default.

~1. Makes transaction recovery and distributed deadlocks detection respect citus.max_shared_pool_size~ ~1. Adds citus.shared_pool_size_maintenance_quota GUC to reserve a fraction of citus.max_shared_pool_size for maintenance operations exclusively. So, by default, regular and maintenance operations have isolated fractions of the pool and do not compete with each other~ ~1. Adds the possibility to assign one database as dedicated for maintenance via new citus.maintenance_management_database GUC. I'll integrate it later with #7254 if it is merged~ ~1. Disables connections caching for all maintenance backends, except the one for citus.maintenance_management_database~ ~1. Makes distributed deadlocks detection be invoked only on citus.maintenance_management_database, when it is set.~

~When citus.maintenance_management_database is not set (default), all maintenance daemons perform distributed deadlocks detection, but they never cache connections and respect the citus.shared_pool_size_maintenance_quota or citus.max_shared_pool_size.~

Logic, related to citus.main_db will be introduced in a separate PR.

This is a draft implementation to check if I am moving in the right direction, so there is only the base scenario covered by test. I also haven't found an example of distributed deadlocks detection test, so I'll appreciate some advice on this, as well as how to better address the @onderkalaci's concern:

I think even the latter is non-trivial to implement, there are bunch of areas needs to be thought of well, such as make sure that this is not broken: https://github.com/citusdata/citus/blob/main/src/backend/distributed/connection/locally_reserved_shared_connections.c or https://github.com/citusdata/citus/blob/main/src/backend/distributed/executor/adaptive_executor.c#L4707

cc @marcocitus

closes #7244

ivyazmitinov avatar Oct 27 '23 15:10 ivyazmitinov

Makes transaction recovery and distributed deadlocks detection respect citus.max_shared_pool_size

I think this is probably the more controversial / hazardous part of this PR. It is required for stability that max_shared_pool_size >= max_client_connections, otherwise not every client connection can get a slot. We generally recommend max_client_connections == max_shared_pool_size such that you will not make more outbound connections per worker than you receive in inbound client connections, which means there is no slack in the max_shared_pool_size in most set ups.

If we change the semantics of max_shared_pool_size, we might risk a lot of set ups becoming unstable unless they set max_shared_pool_size >= max_client_connections + # reserved maintenance connections

Adds the possibility to assign one database as dedicated for maintenance via new citus.maintenance_management_database GUC. I'll integrate it later with https://github.com/citusdata/citus/pull/7254 if it is merged Disables connections caching for all maintenance backends, except the one for citus.maintenance_management_database Makes distributed deadlocks detection be invoked only on citus.maintenance_management_database, when it is set.

Could this part be a separate PR? It probably doesn't require much discussion and #7254 is merged now.

marcoslot avatar Nov 01 '23 09:11 marcoslot

If we change the semantics of max_shared_pool_size, we might risk a lot of setups becoming unstable unless they set max_shared_pool_size >= max_client_connections + # reserved maintenance connections

I see your point, but it is still a more predictable way to manage a cluster, I think. Currently, DBAs just have to keep in mind that there are some operations that require additional connections, so max_connections is typically slightly larger than max_shared_pool_size + max_client_connections. With this PR, the "slightly larger" part just becomes "official": configurable and isolated into its own part of max_shared_pool_size. And, of course, it is always possible to implement a backup feature flag :slightly_smiling_face:

Could this part be a separate PR?

I'd have to move those points as well, then:

  • Disables connections caching for all maintenance backends, except the one for citus.maintenance_management_database
  • Makes distributed deadlocks detection be invoked only on citus.maintenance_management_database, when it is set.

If it is okay, then no problem :slightly_smiling_face:

ivyazmitinov avatar Nov 01 '23 09:11 ivyazmitinov

I see your point, but it is still a more predictable way to manage a cluster,

Perhaps, if we were starting from scratch. In practice, due to a large user / customer base, and platforms that need to support a large number of version combinations, we try to avoid breaking backwards compatibility and/or introducing custom upgrade steps whenever possible.

An alternative option could be to have 2 separate pool settings, e.g. max_maintenance_pool_size & max_shared_pool_size. The former could default to some small number (e.g. 5), to keep the total number of outbound connections predictable.

If it is okay, then no problem 🙂

Yup

marcoslot avatar Nov 03 '23 22:11 marcoslot

Sounds reasonable, and shouldn't be too hard to implement... Ok, I'll be back with that and a new PR for main_db then :slightly_smiling_face:

ivyazmitinov avatar Nov 08 '23 08:11 ivyazmitinov

@marcoslot, done, now it is a separate pool :slightly_smiling_face:

ivyazmitinov avatar Nov 30 '23 12:11 ivyazmitinov

@onderkalaci, kind remainder about this PR :slightly_smiling_face: I noticed that @marcoslot is at Crunchy Data now, so wanted to make sure that the PR won't be abandoned due to organizational changes

ivyazmitinov avatar Jan 09 '24 11:01 ivyazmitinov

Well, that's awkward... @JelteF, maybe you could help :) ?

ivyazmitinov avatar Jan 11 '24 09:01 ivyazmitinov

I'll try to take a look at this soon, thank you for your work on this.

JelteF avatar Jan 11 '24 09:01 JelteF

Could you run citus_indent on your PR? It's hard for me to see what code is actually changed because some tabs have been changed to spaces. Instructions to run it can be found here: https://github.com/citusdata/citus/blob/main/CONTRIBUTING.md#following-our-coding-conventions

JelteF avatar Jan 18 '24 10:01 JelteF

FYI I pushed a commit to your branch to reindent it.

JelteF avatar Jan 18 '24 12:01 JelteF

@JelteF Great, thanks! I'll address your comments shortly. Test fails doesn't seem to be caused by my changes, right? But the branch should be synced with main anyway, may be it will help :man_shrugging:

I also concerned with the test coverage: as I mentioned, I covered only basic scenario. Maybe, there are some use-cases that should be also covered? Especially the distributed deadlock detection, which I am not sure how to test with the existing tool set

ivyazmitinov avatar Jan 30 '24 10:01 ivyazmitinov

The build error on PG16 seems to be because you included #include "distributed/connection_management.h" after pg_versioncompat.h in maintenanced.c. (this is arguably a bug in in how we define has_createdb_privilege in pg_versioncompat.h, but it's only being triggered because of your changes).

JelteF avatar Jan 30 '24 13:01 JelteF

@JelteF, fixed pg16 build and addressed the review.

The #define have_createdb_privilege() have_createdb_privilege() in pg_version_compat.h seems to be unnecessary, and it breaks the build if included before commands/dbcommands.h. Seems to work fine without it

ivyazmitinov avatar Feb 01 '24 14:02 ivyazmitinov

I also experienced some flakiness with multi_extension, is this expected?

ivyazmitinov avatar Feb 01 '24 14:02 ivyazmitinov

Codecov Report

Attention: Patch coverage is 90.58296% with 21 lines in your changes missing coverage. Please review.

Project coverage is 81.44%. Comparing base (2874d7a) to head (a227bcc). Report is 14 commits behind head on main.

:exclamation: Current head a227bcc differs from pull request most recent head bf2572a

Please upload reports for the commit bf2572a to get more accurate results.

:exclamation: There is a different number of reports uploaded between BASE (2874d7a) and HEAD (a227bcc). Click for more details.

HEAD has 98 uploads less than BASE
Flag BASE (2874d7a) HEAD (a227bcc)
15_regress_check-pytest 2 0
15_regress_check-enterprise-isolation-logicalrep-3 2 0
16_regress_check-enterprise-failure 2 0
14_regress_check-vanilla 2 0
15_regress_check-vanilla 2 0
15_regress_check-multi-mx 2 0
15_regress_check-enterprise-isolation-logicalrep-2 2 0
14_regress_check-operations 2 0
16_regress_check-operations 2 0
16_regress_check-columnar 2 0
14_regress_check-split 2 0
15_regress_check-columnar 2 0
16_regress_check-enterprise 4 0
_upgrade 21 0
14_16_upgrade 1 0
15_regress_check-columnar-isolation 1 0
14_regress_check-follower-cluster 2 0
16_regress_check-query-generator 1 0
15_regress_check-follower-cluster 1 0
15_regress_check-enterprise-isolation-logicalrep-1 1 0
14_regress_check-query-generator 1 0
16_regress_check-multi-mx 1 0
14_regress_check-enterprise-isolation-logicalrep-2 1 0
16_regress_check-failure 1 0
16_regress_check-columnar-isolation 1 0
16_regress_check-enterprise-isolation 1 0
15_16_upgrade 1 0
15_regress_check-enterprise-isolation 1 0
14_regress_check-enterprise-isolation 1 0
16_regress_check-isolation 1 0
14_regress_check-isolation 1 0
16_regress_check-pytest 1 0
14_regress_check-pytest 1 0
16_regress_check-multi 1 0
15_regress_check-query-generator 1 0
15_regress_check-enterprise-failure 1 0
14_regress_check-enterprise-failure 1 0
15_regress_check-enterprise 1 0
14_regress_check-multi 1 0
15_regress_check-isolation 1 0
15_regress_check-multi-1 1 0
14_regress_check-multi-1 1 0
14_regress_check-failure 1 0
15_regress_check-operations 1 0
14_regress_check-columnar-isolation 1 0
15_regress_check-split 1 0
16_regress_check-follower-cluster 1 0
16_regress_check-enterprise-isolation-logicalrep-2 1 0
14_regress_check-columnar 1 0
16_regress_check-split 1 0
16_regress_check-enterprise-isolation-logicalrep-3 1 0
14_regress_check-enterprise 1 0
16_regress_check-vanilla 1 0
16_cdc_installcheck 1 0
14_15_upgrade 1 0
15_cdc_installcheck 1 0
16_regress_check-enterprise-isolation-logicalrep-1 1 0
14_regress_check-multi-mx 1 0
14_regress_check-enterprise-isolation-logicalrep-3 1 0
15_regress_check-failure 1 0
14_regress_check-enterprise-isolation-logicalrep-1 1 0
15_regress_check-multi 1 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7286      +/-   ##
==========================================
- Coverage   89.66%   81.44%   -8.22%     
==========================================
  Files         283      282       -1     
  Lines       60525    59424    -1101     
  Branches     7540     7360     -180     
==========================================
- Hits        54268    48399    -5869     
- Misses       4102     8297    +4195     
- Partials     2155     2728     +573     

codecov[bot] avatar Feb 01 '24 15:02 codecov[bot]

Tried to fix the check_gucs_are_alphabetically_sorted and got a bit confused: it seems to be failing on citus.enable_repartitioned_insert_select even on main, on my machine. But it somehow passes on your CI :thinking:

ivan@ivan-pc:~/IdeaProjects/citus-fork$ ./ci/check_gucs_are_alphabetically_sorted.sh 
sort: gucs.out:40: disorder:            "citus.enable_repartitioned_insert_select",

ivyazmitinov avatar Feb 02 '24 09:02 ivyazmitinov

I haven't looked at it in more detail yet, but at least the flaky test detection is failing because this test randomly fails:

+++ /__w/citus/citus/src/test/regress/results/multi_maintenance_multiple_databases.out.modified	2024-02-02 13:40:41.496394020 +0000
@@ -277,21 +277,21 @@
                    (SELECT setting::int FROM pg_settings WHERE name = 'port')),
             $statement$
             SELECT groupid, gid
             FROM pg_dist_transaction
             WHERE gid LIKE 'citus_0_1234_4_0_%'
                 OR gid LIKE 'citus_0_should_be_forgotten_%'
             $statement$) AS t(groupid integer, gid text)
 WHERE datname LIKE 'db%';
  pg_dist_transaction_after_recovery_coordinator_test 
 -----------------------------------------------------
- t
+ f
 (1 row)
 
 SELECT count(*) = 0 AS cached_connections_after_recovery_coordinator_test
 FROM pg_stat_activity
 WHERE state = 'idle'
   AND now() - backend_start > '5 seconds'::interval;
  cached_connections_after_recovery_coordinator_test 
 ----------------------------------------------------

JelteF avatar Feb 06 '24 14:02 JelteF

@JelteF, is there a way I can download the results of failed tests?

Tried to reproduce locally but no use. But my tests time twice as long, so my theory is that it is reproducible on weaker machines

ivyazmitinov avatar Feb 07 '24 15:02 ivyazmitinov

Yes if you go to the overview and scroll to the bottom you will see all the artifacts produced and also the diff of the failures inline: https://github.com/citusdata/citus/actions/runs/7812230917/attempts/2#summary-21309102210

JelteF avatar Feb 07 '24 15:02 JelteF

The flaky tests you might be able to reproduce locally by running the test in question multiple times (this runs it 200 times):

src/test/regress/citus_tests/run_test.py multi_maintenance_multiple_databases  --repeat 200

JelteF avatar Feb 07 '24 15:02 JelteF

src/test/regress/citus_tests/run_test.py multi_maintenance_multiple_databases --repeat 200

Yep, that's what I tried :) It seems, it requires something more sophisticated...

ivyazmitinov avatar Feb 07 '24 15:02 ivyazmitinov

I see this test also failed, normally I would attribute that to pre-existing flakiness. But since it's a background worker I'm wondering if it's related to your changes.

diff -dU10 -w /__w/citus/citus/src/test/regress/expected/background_task_queue_monitor.out /__w/citus/citus/src/test/regress/results/background_task_queue_monitor.out
--- /__w/citus/citus/src/test/regress/expected/background_task_queue_monitor.out.modified	2024-02-07 09:06:34.361655662 +0000
+++ /__w/citus/citus/src/test/regress/results/background_task_queue_monitor.out.modified	2024-02-07 09:06:34.373655599 +0000
@@ -488,25 +488,23 @@
  t
 (1 row)
 
 CALL citus_wait_task_until_retried(:task_id1); -- shows that we increased retry count for task after failure
 CALL citus_wait_task_until_retried(:task_id2); -- shows that we increased retry count for task after failure
 SELECT task_id, status, retry_count, message FROM pg_dist_background_task
     WHERE task_id IN (:task_id1, :task_id2)
     ORDER BY task_id; -- show that all tasks are runnable by retry policy after termination signal
  task_id | status | retry_count | message  
 ---------+--------+-------------+----------
- 1450019 | runnable |           1 | FATAL: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx)" due to administrator command+
-         |          |             | CONTEXT: Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx)                                                           +
+ 1450019 | done   |             | SELECT 1+
          |        |             | 
- 1450020 | runnable |           1 | FATAL: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx)" due to administrator command+
-         |          |             | CONTEXT: Citus Background Task Queue Executor: regression/postgres for (xxxxx/xxxxx)                                                           +
+ 1450020 | done   |             | SELECT 1+
          |        |             | 
 (2 rows)
 
 SELECT citus_job_cancel(:job_id1);
  citus_job_cancel 
 ------------------
  
 (1 row)
 
 SELECT citus_job_cancel(:job_id2);
@@ -525,22 +523,22 @@
  citus_job_wait 
 ----------------
  
 (1 row)
 
 SELECT task_id, status FROM pg_dist_background_task
     WHERE task_id IN (:task_id1, :task_id2)
     ORDER BY task_id; -- show that all tasks are cancelled
  task_id | status 
 ---------+--------
- 1450019 | cancelled
- 1450020 | cancelled
+ 1450019 | done
+ 1450020 | done
 (2 rows)
 
 -- TEST9
 -- verify that upon cancellation signal, all tasks are cancelled
 BEGIN;
 INSERT INTO pg_dist_background_job (job_type, description) VALUES ('test_job', 'simple test to verify cancellation on monitor') RETURNING job_id AS job_id1 \gset
 INSERT INTO pg_dist_background_task (job_id, command) VALUES (:job_id1, $job$ SELECT pg_sleep(5); $job$) RETURNING task_id AS task_id1 \gset
 INSERT INTO pg_dist_background_job (job_type, description) VALUES ('test_job', 'simple test to verify cancellation on monitor') RETURNING job_id AS job_id2 \gset
 INSERT INTO pg_dist_background_task (job_id, command) VALUES (:job_id2, $job$ SELECT pg_sleep(5); $job$) RETURNING task_id AS task_id2 \gset
 COMMIT;

The logs from the artifact from that session are these:

2024-02-07 09:05:36.785 UTC [1199] LOG:  00000: found scheduled background tasks, starting new background task queue monitor
2024-02-07 09:05:36.785 UTC [1199] CONTEXT:  Citus maintenance daemon for database 16384 user 10
2024-02-07 09:05:36.785 UTC [1199] LOCATION:  CitusMaintenanceDaemonMain, maintenanced.c:907
2024-02-07 09:05:36.788 UTC [17878] LOG:  00000: task jobid/taskid started: 1450011/1450019
2024-02-07 09:05:36.788 UTC [17878] CONTEXT:  Citus Background Task Queue Monitor: regression
2024-02-07 09:05:36.788 UTC [17878] LOCATION:  AssignRunnableTaskToNewExecutor, background_jobs.c:603
2024-02-07 09:05:36.789 UTC [17878] LOG:  00000: task jobid/taskid started: 1450012/1450020
2024-02-07 09:05:36.789 UTC [17878] CONTEXT:  Citus Background Task Queue Monitor: regression
2024-02-07 09:05:36.789 UTC [17878] LOCATION:  AssignRunnableTaskToNewExecutor, background_jobs.c:603
2024-02-07 09:05:37.773 UTC [17878] LOG:  00000: handling termination signal
2024-02-07 09:05:37.773 UTC [17878] CONTEXT:  Citus Background Task Queue Monitor: regression
2024-02-07 09:05:37.773 UTC [17878] LOCATION:  CitusBackgroundTaskQueueMonitorMain, background_jobs.c:1232
2024-02-07 09:05:37.773 UTC [17879] FATAL:  57P01: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (1450011/1450019)" due to administrator command
2024-02-07 09:05:37.773 UTC [17879] CONTEXT:  Citus Background Task Queue Executor: regression/postgres for (1450011/1450019)
2024-02-07 09:05:37.773 UTC [17879] LOCATION:  ProcessInterrupts, postgres.c:3199
2024-02-07 09:05:37.773 UTC [17880] FATAL:  57P01: terminating background worker "Citus Background Task Queue Executor: regression/postgres for (1450012/1450020)" due to administrator command
2024-02-07 09:05:37.773 UTC [17880] CONTEXT:  Citus Background Task Queue Executor: regression/postgres for (1450012/1450020)
2024-02-07 09:05:37.773 UTC [17880] LOCATION:  ProcessInterrupts, postgres.c:3199
2024-02-07 09:05:37.773 UTC [17878] LOG:  00000: task jobid/taskid failed: 1450012/1450020
2024-02-07 09:05:37.773 UTC [17878] CONTEXT:  Citus Background Task Queue Monitor: regression
2024-02-07 09:05:37.773 UTC [17878] LOCATION:  ConsumeExecutorQueue, background_jobs.c:770
2024-02-07 09:05:37.779 UTC [1114] LOG:  00000: background worker "Citus Background Task Queue Executor: regression/postgres for (1450011/1450019)" (PID 17879) exited with exit code 1
2024-02-07 09:05:37.779 UTC [1114] LOCATION:  LogChildExit, postmaster.c:3740
2024-02-07 09:05:37.779 UTC [1114] LOG:  00000: background worker "Citus Background Task Queue Executor: regression/postgres for (1450012/1450020)" (PID 17880) exited with exit code 1
2024-02-07 09:05:37.779 UTC [1114] LOCATION:  LogChildExit, postmaster.c:3740
2024-02-07 09:05:37.779 UTC [17878] ERROR:  54000: invalid message size 21463928358196037 in shared memory queue
2024-02-07 09:05:37.779 UTC [17878] CONTEXT:  Citus Background Task Queue Monitor: regression
2024-02-07 09:05:37.779 UTC [17878] LOCATION:  shm_mq_receive, shm_mq.c:692

The invalid message size 21463928358196037 in shared memory queue seems suspect, but maybe unrelated. Or a side effect of the termination, not the cause.

JelteF avatar Feb 07 '24 17:02 JelteF

@JelteF

I see this test also failed

Tried to reproduce locally, but no use. I don't think it is related, since my changes mostly concentrated in the connection_management.c, so it is a pretty long reach to background tasks. And even then, if interpreted correctly, the issue is that job is done before it is terminated, so it is more likely that the problem is with citus_wait_task_until_retried

ivyazmitinov avatar Feb 12 '24 11:02 ivyazmitinov

@JelteF, flakiness seems to be dealt with :slightly_smiling_face:

ivyazmitinov avatar Feb 12 '24 13:02 ivyazmitinov

@JelteF, reminder about this PR, just in case :slightly_smiling_face:

ivyazmitinov avatar Mar 21 '24 08:03 ivyazmitinov

I'll try to make some time soon to do a full review of this, I think it's pretty close to being mergable.

For now, could you look at the check-multi tests. Because they are timing out in CI.

JelteF avatar Mar 21 '24 11:03 JelteF

I think it's pretty close to being mergable.

I still would like to add more test coverage, especially deadlock detection, but not sure how to do this with the existing toolkit. The scenario I have added covers only basic flow

ivyazmitinov avatar Mar 21 '24 11:03 ivyazmitinov

but not sure how to do this with the existing toolkit.

I think the python test framework that we have would be a good fit for doing that. Readme and examples can be found here: https://github.com/citusdata/citus/tree/main/src/test/regress/citus_tests/test

JelteF avatar Mar 21 '24 12:03 JelteF

For now, could you look at the check-multi tests. Because they are timing out in CI.

Not sure when I'll be able to get back to this, so just leave a note here: test are timing out because of low citus.max_maintenance_shared_pool_size: some operations require more than one connection to complete, especially with the new MainDb logic which requires to extra maintenance connection for a SELECT citus_internal.commit_management_command_2pc() call. The proper fix would be to implement some API to reserve necessary amount of connections from the maintenance pool and use it for all maintenance operations, but it doesn't seem feasible. Instead it is better to introduce a simple timeout for acquiring connection for a maintenance pool with a meaningful error message: this way a user will be able to see that citus.max_maintenance_shared_pool_size is too small and increase it accordingly. It will also be appropriate to disable the maintenance pool limit by default: issues with maintenance connections are more likely to arise with considerable amount of databases, when there are only few it may not be necessary.

ivyazmitinov avatar Apr 19 '24 10:04 ivyazmitinov

The timeout is basically implemented, along with a fix of conditional variables for new pool, which I have missed before, but there are still issues related to #7203 and test coverage that should be improved

ivyazmitinov avatar May 09 '24 14:05 ivyazmitinov