citus_docs icon indicating copy to clipboard operation
citus_docs copied to clipboard

Document connectivity check UDFs in 11.0

Open hanefi opened this issue 2 years ago • 1 comments

Why are we implementing it? (sales eng)

  1. Improved supportability.

Citus MX and (in general Citus) requires the nodes to be able to connect to each other. Citus now provides a simple UDF that provides the connectivity information across the whole cluster.

  1. Possible improvements in data planes

We can check for connectivity issues regularly in our managed database offerings. This can help detect network issues before they are reported by the customer via a ticket.

What are the typical use cases?

  • Routine controls in Control planes
  • Database Operators working on troubleshooting in a cluster

Communication goals (e.g. detailed howto vs orientation)

Good locations for content in docs structure

  • A new section under Useful Diagnostic Queries
  • Cluster Management -> Security -> Connection management We can have a note section here indicating that the users can use the new UDF to check for broken ssl setups, or other connection management configuration problems.

How does this work? (devs)

citus_check_cluster_node_health() returns setof records with names and ports for the source and target nodes, and the connectivity check result.

There are 3 possible result column values:

  • true : connection attempt from source to target succeeded.
  • false : connection attempt from source to target failed.
  • NULL : connection attempt from the current node to source node failed.

Example sql

This UDF can return many rows on large clusters. We suggest operators to run the following query that returns a single boolean first. If this one is true, all the checks were successful. Otherwise, the operator should look into the connectivity report on all the worker pairs.

SELECT bool_and(coalesce(result, false)) FROM citus_check_cluster_node_health();
 bool_and
---------------------------------------------------------------------
 f
(1 row)

For a quick summary, one can also use the following query to get the distinct counts of connectivity results in the cluster.

SELECT result, count(*) FROM citus_check_cluster_node_health() GROUP BY result ORDER BY 1;
 result | count
---------------------------------------------------------------------
 t      |     2
        |     2
(2 rows)

To check all the results for the connectivity checks, run the following:

SELECT * FROM citus_check_cluster_node_health() ORDER BY 1,2,3,4;
 from_nodename | from_nodeport | to_nodename | to_nodeport | result
---------------------------------------------------------------------
 localhost     |         57637 | localhost   |       57637 | t
 localhost     |         57637 | localhost   |       57638 | t
 localhost     |         57638 | localhost   |       57637 | 
 localhost     |         57638 | localhost   |       57638 | 

Corner cases, gotchas

It is risky to use the following query to check for an aggregate summary: SELECT bool_and(result) FROM citus_check_cluster_node_health(); because bool_and aggregate returns true even if some of the rows were NULL. Using coalesce(result, false) is crucial here.

Are there relevant blog posts or outside documentation about the concept/feature?

Link to relevant commits and regression tests if applicable

https://github.com/citusdata/citus/pull/5509

hanefi avatar Dec 16 '21 22:12 hanefi

https://github.com/citusdata/citus_docs/pull/1038 includes docs for citus_check_cluster_node_health() itself, but we're still lacking content in useful diagnostic queries and the cluster management page like you suggsted.

jonels-msft avatar Jun 16 '22 19:06 jonels-msft