timescaledb
timescaledb copied to clipboard
[Flaky test] dist_api_calls
Which test is flaky?
dist_api_calls
Since when has the test been flaky?
yes
Link to the failed test run
https://github.com/timescale/timescaledb/runs/4927543203?check_suite_focus=true
Log output
diff -u /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out
--- /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out 2022-01-24 20:08:47.839263390 +0000
+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out 2022-01-24 20:12:34.891341510 +0000
@@ -380,13 +380,4 @@
(1 row)
ANALYZE disttable_repl;
-SELECT approximate_row_count('disttable_repl');
- approximate_row_count
------------------------
- 8
-(1 row)
-
-DROP TABLE disttable_repl;
-DROP DATABASE :DN_DBNAME_1;
-DROP DATABASE :DN_DBNAME_2;
-DROP DATABASE :DN_DBNAME_3;
+ERROR: [data_node_1]: relation "disttable_repl" does not exist
Reason for flakiness
No response
@svenklemm often I face this flaky test... seems is something related to the test.remote_exec
but I didn't deep dive on it. Shouldn't be better place it to the IGNORE list?
This looks like a serious multinode failure i dont think ignoring this is the right approach here, also i've seen that same error on several other multinode tests.
Fails here as well: https://github.com/timescale/timescaledb/runs/6591594389?check_suite_focus=true
And here: https://github.com/timescale/timescaledb/runs/6827486034?check_suite_focus=true
diff -u /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out
--- /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out 2022-06-10 08:25:58.171757882 +0000
+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out 2022-06-10 08:29:52.126006906 +0000
@@ -380,13 +380,4 @@
(1 row)
ANALYZE disttable_repl;
-SELECT approximate_row_count('disttable_repl');
- approximate_row_count
------------------------
- 8
-(1 row)
-
-DROP TABLE disttable_repl;
-DROP DATABASE :DN_DBNAME_1;
-DROP DATABASE :DN_DBNAME_2;
-DROP DATABASE :DN_DBNAME_3;
+ERROR: [data_node_1]: relation "disttable_repl" does not exist
https://github.com/timescale/timescaledb/runs/7210303471?check_suite_focus=true
https://github.com/timescale/timescaledb/runs/7253018386?check_suite_focus=true
https://github.com/timescale/timescaledb/runs/8152608921?check_suite_focus=true
https://github.com/timescale/timescaledb/runs/8185469083?check_suite_focus=true
diff -u /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out
--- /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out 2022-09-05 09:20:02.015418326 +0000
+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out 2022-09-05 09:23:57.626568430 +0000
@@ -380,13 +380,4 @@
(1 row)
ANALYZE disttable_repl;
-SELECT approximate_row_count('disttable_repl');
- approximate_row_count
------------------------
- 8
-(1 row)
-
-DROP TABLE disttable_repl;
-DROP DATABASE :DN_DBNAME_1;
-DROP DATABASE :DN_DBNAME_2;
-DROP DATABASE :DN_DBNAME_3;
+ERROR: [data_node_1]: relation "disttable_repl" does not exist
The usual mystery disappearance of the disttable_repl
... I added statement logging to investigate it, does the server log show anything?
The usual mystery disappearance of the
disttable_repl
... I added statement logging to investigate it, does the server log show anything?
For some reason, the logs are not available. Maybe I am looking at the wrong place?
https://github.com/timescale/timescaledb/actions/runs/2991955278/attempts/2
The usual mystery disappearance of the
disttable_repl
... I added statement logging to investigate it, does the server log show anything?For some reason, the logs are not available. Maybe I am looking at the wrong place?
https://github.com/timescale/timescaledb/actions/runs/2991955278/attempts/2
Seems artifacts are deleted from older jobs. This looks fine.
https://github.com/timescale/timescaledb/actions/runs/2991955278
So,
ANALYZE disttable_repl;
the above always sets the search_path
before sending the ANALYZE command to the datanodes.
The further steps which send calls to DNs use the deparse_func_call
function which schema qualifies table name references explicitly. Like
SELECT chunk_id,hypertable_id,att_num,..... FROM _timescaledb_internal.get_chunk_colstats(relid => 'public.disttable_repl')
However the error that we see from data_node_1
is:
ERROR: [data_node_1]: relation "disttable_repl" does not exist
I am wondering if somehow in ts_dist_cmd_invoke_on_data_nodes_using_search_path
, the first request to set the search path happens on connection1 to data_node_1
. Then the immediate request to ANALYZE disttable_repl
happens on connection2 to data_node_1
? If this happens then connection2 will not have public
in its search_path
and the call to ANALYZE disttable_repl
will fail as above.
I was trying to look at ways in which we could get two different connections above. It's possible that a call to cache_syscache_invalidate
might invalidate the connection1
entry causing the connection_cache_update_entry
function to determine that connection1
is unusable leading it to create connection2
without the search_path set. That can then lead to this relation "disttable_repl" does not exist
error from one of the datanodes.
So my theory about different connections being used is correct!
The attached PG log has a failed dist_api_calls test.
PostgreSQL log ubuntu-20.04 Debug 12.12.zip
cluster_super_user [12841] db_dist_api_calls_1 LOCATION: exec_simple_query, postgres.c:1041
cluster_super_user [12841] db_dist_api_calls_1 LOG: 00000: execute <unnamed>: SET search_path = "$user", public, pg_catalog
cluster_super_user [12841] db_dist_api_calls_1 LOCATION: exec_execute_message, postgres.c:2045
As can be seen above the ts_dist_cmd_invoke_on_data_nodes_using_search_path
function is setting the search_path
above. The PID for the connection is 12841
Then it expects that the ANALYZE
call will run on this same connection. However from the logs below we see that a new backend with PID 12847
is being used for the ANALYZE
call.
cluster_super_user [12847] db_dist_api_calls_1 LOG: 00000: statement: SET search_path = pg_catalog;SET datestyle = ISO;SET intervalstyle = postgres;SET extra_float_digits = 3;
cluster_super_user [12847] db_dist_api_calls_1 LOCATION: exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG: 00000: statement: SET TIMEZONE = 'PST8PDT'
cluster_super_user [12847] db_dist_api_calls_1 LOCATION: exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG: 00000: statement: SELECT extversion FROM pg_extension WHERE extname = 'timescaledb'
cluster_super_user [12847] db_dist_api_calls_1 LOCATION: exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG: 00000: statement: SELECT * FROM _timescaledb_internal.set_peer_dist_id('0571304f-b931-4a8a-a40a-2b835986d10b')
cluster_super_user [12847] db_dist_api_calls_1 LOCATION: exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG: 00000: execute <unnamed>: ANALYZE disttable_repl;
cluster_super_user [12847] db_dist_api_calls_1 LOCATION: exec_execute_message, postgres.c:2045
cluster_super_user [12847] db_dist_api_calls_1 ERROR: 42P01: relation "disttable_repl" does not exist
cluster_super_user [12847] db_dist_api_calls_1 LOCATION: RangeVarGetRelidExtended, namespace.c:424
cluster_super_user [12847] db_dist_api_calls_1 STATEMENT: ANALYZE disttable_repl;
The 12847
connection does not have the search_path
set appropriately leading to the relation "disttable_repl" does not exist
exists error.
https://github.com/timescale/timescaledb/actions/runs/3058131767/jobs/4934005830
https://github.com/timescale/timescaledb/actions/runs/3080469953/jobs/4977851983
https://github.com/timescale/timescaledb/actions/runs/3136271306/jobs/5092974002
PR #4771 submitted to solve this