timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

[Flaky test] dist_api_calls

Open svenklemm opened this issue 2 years ago • 16 comments

Which test is flaky?

dist_api_calls

Since when has the test been flaky?

yes

Link to the failed test run

https://github.com/timescale/timescaledb/runs/4927543203?check_suite_focus=true

Log output

diff -u /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out
--- /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out	2022-01-24 20:08:47.839263390 +0000
+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out	2022-01-24 20:12:34.891341510 +0000
@@ -380,13 +380,4 @@
 (1 row)
 
 ANALYZE disttable_repl;
-SELECT approximate_row_count('disttable_repl');
- approximate_row_count 
------------------------
-                     8
-(1 row)
-
-DROP TABLE disttable_repl;
-DROP DATABASE :DN_DBNAME_1;
-DROP DATABASE :DN_DBNAME_2;
-DROP DATABASE :DN_DBNAME_3;
+ERROR:  [data_node_1]: relation "disttable_repl" does not exist

Reason for flakiness

No response

svenklemm avatar Jan 25 '22 07:01 svenklemm

@svenklemm often I face this flaky test... seems is something related to the test.remote_exec but I didn't deep dive on it. Shouldn't be better place it to the IGNORE list?

fabriziomello avatar May 12 '22 21:05 fabriziomello

This looks like a serious multinode failure i dont think ignoring this is the right approach here, also i've seen that same error on several other multinode tests.

svenklemm avatar May 14 '22 06:05 svenklemm

Fails here as well: https://github.com/timescale/timescaledb/runs/6591594389?check_suite_focus=true

mkindahl avatar May 25 '22 12:05 mkindahl

And here: https://github.com/timescale/timescaledb/runs/6827486034?check_suite_focus=true

diff -u /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out
--- /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out	2022-06-10 08:25:58.171757882 +0000
+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out	2022-06-10 08:29:52.126006906 +0000
@@ -380,13 +380,4 @@
 (1 row)
 ANALYZE disttable_repl;
-SELECT approximate_row_count('disttable_repl');
- approximate_row_count 
------------------------
-                     8
-(1 row)
-
-DROP TABLE disttable_repl;
-DROP DATABASE :DN_DBNAME_1;
-DROP DATABASE :DN_DBNAME_2;
-DROP DATABASE :DN_DBNAME_3;
+ERROR:  [data_node_1]: relation "disttable_repl" does not exist

mkindahl avatar Jun 10 '22 09:06 mkindahl

https://github.com/timescale/timescaledb/runs/7210303471?check_suite_focus=true

mkindahl avatar Jul 06 '22 07:07 mkindahl

https://github.com/timescale/timescaledb/runs/7253018386?check_suite_focus=true

mkindahl avatar Jul 08 '22 14:07 mkindahl

https://github.com/timescale/timescaledb/runs/8152608921?check_suite_focus=true

mkindahl avatar Sep 02 '22 09:09 mkindahl

https://github.com/timescale/timescaledb/runs/8185469083?check_suite_focus=true

diff -u /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out
--- /home/runner/work/timescaledb/timescaledb/tsl/test/expected/dist_api_calls.out	2022-09-05 09:20:02.015418326 +0000
+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/dist_api_calls.out	2022-09-05 09:23:57.626568430 +0000
@@ -380,13 +380,4 @@
 (1 row)
 
 ANALYZE disttable_repl;
-SELECT approximate_row_count('disttable_repl');
- approximate_row_count 
------------------------
-                     8
-(1 row)
-
-DROP TABLE disttable_repl;
-DROP DATABASE :DN_DBNAME_1;
-DROP DATABASE :DN_DBNAME_2;
-DROP DATABASE :DN_DBNAME_3;
+ERROR:  [data_node_1]: relation "disttable_repl" does not exist

mkindahl avatar Sep 05 '22 09:09 mkindahl

The usual mystery disappearance of the disttable_repl... I added statement logging to investigate it, does the server log show anything?

akuzm avatar Sep 05 '22 12:09 akuzm

The usual mystery disappearance of the disttable_repl... I added statement logging to investigate it, does the server log show anything?

For some reason, the logs are not available. Maybe I am looking at the wrong place?

https://github.com/timescale/timescaledb/actions/runs/2991955278/attempts/2

mkindahl avatar Sep 06 '22 07:09 mkindahl

The usual mystery disappearance of the disttable_repl... I added statement logging to investigate it, does the server log show anything?

For some reason, the logs are not available. Maybe I am looking at the wrong place?

https://github.com/timescale/timescaledb/actions/runs/2991955278/attempts/2

Seems artifacts are deleted from older jobs. This looks fine.

https://github.com/timescale/timescaledb/actions/runs/2991955278

mkindahl avatar Sep 06 '22 07:09 mkindahl

So,

ANALYZE disttable_repl;

the above always sets the search_path before sending the ANALYZE command to the datanodes.

The further steps which send calls to DNs use the deparse_func_call function which schema qualifies table name references explicitly. Like

SELECT chunk_id,hypertable_id,att_num,..... FROM _timescaledb_internal.get_chunk_colstats(relid => 'public.disttable_repl')

However the error that we see from data_node_1 is:

ERROR:  [data_node_1]: relation "disttable_repl" does not exist

I am wondering if somehow in ts_dist_cmd_invoke_on_data_nodes_using_search_path, the first request to set the search path happens on connection1 to data_node_1. Then the immediate request to ANALYZE disttable_repl happens on connection2 to data_node_1? If this happens then connection2 will not have public in its search_path and the call to ANALYZE disttable_repl will fail as above.

nikkhils avatar Sep 06 '22 10:09 nikkhils

I was trying to look at ways in which we could get two different connections above. It's possible that a call to cache_syscache_invalidate might invalidate the connection1 entry causing the connection_cache_update_entry function to determine that connection1 is unusable leading it to create connection2 without the search_path set. That can then lead to this relation "disttable_repl" does not exist error from one of the datanodes.

nikkhils avatar Sep 08 '22 07:09 nikkhils

So my theory about different connections being used is correct!

The attached PG log has a failed dist_api_calls test.

PostgreSQL log ubuntu-20.04 Debug 12.12.zip

cluster_super_user [12841] db_dist_api_calls_1 LOCATION:  exec_simple_query, postgres.c:1041
cluster_super_user [12841] db_dist_api_calls_1 LOG:  00000: execute <unnamed>: SET search_path = "$user", public, pg_catalog
cluster_super_user [12841] db_dist_api_calls_1 LOCATION:  exec_execute_message, postgres.c:2045

As can be seen above the ts_dist_cmd_invoke_on_data_nodes_using_search_path function is setting the search_path above. The PID for the connection is 12841

Then it expects that the ANALYZE call will run on this same connection. However from the logs below we see that a new backend with PID 12847 is being used for the ANALYZE call.

cluster_super_user [12847] db_dist_api_calls_1 LOG:  00000: statement: SET search_path = pg_catalog;SET datestyle = ISO;SET intervalstyle = postgres;SET extra_float_digits = 3;
cluster_super_user [12847] db_dist_api_calls_1 LOCATION:  exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG:  00000: statement: SET TIMEZONE = 'PST8PDT'
cluster_super_user [12847] db_dist_api_calls_1 LOCATION:  exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG:  00000: statement: SELECT extversion FROM pg_extension WHERE extname = 'timescaledb'
cluster_super_user [12847] db_dist_api_calls_1 LOCATION:  exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG:  00000: statement: SELECT * FROM _timescaledb_internal.set_peer_dist_id('0571304f-b931-4a8a-a40a-2b835986d10b')
cluster_super_user [12847] db_dist_api_calls_1 LOCATION:  exec_simple_query, postgres.c:1041
cluster_super_user [12847] db_dist_api_calls_1 LOG:  00000: execute <unnamed>: ANALYZE disttable_repl;
cluster_super_user [12847] db_dist_api_calls_1 LOCATION:  exec_execute_message, postgres.c:2045
cluster_super_user [12847] db_dist_api_calls_1 ERROR:  42P01: relation "disttable_repl" does not exist
cluster_super_user [12847] db_dist_api_calls_1 LOCATION:  RangeVarGetRelidExtended, namespace.c:424
cluster_super_user [12847] db_dist_api_calls_1 STATEMENT:  ANALYZE disttable_repl;

The 12847 connection does not have the search_path set appropriately leading to the relation "disttable_repl" does not exist exists error.

nikkhils avatar Sep 08 '22 08:09 nikkhils

https://github.com/timescale/timescaledb/actions/runs/3058131767/jobs/4934005830

mkindahl avatar Sep 15 '22 06:09 mkindahl

https://github.com/timescale/timescaledb/actions/runs/3080469953/jobs/4977851983

mkindahl avatar Sep 19 '22 08:09 mkindahl

https://github.com/timescale/timescaledb/actions/runs/3136271306/jobs/5092974002

mkindahl avatar Sep 27 '22 14:09 mkindahl

PR #4771 submitted to solve this

nikkhils avatar Oct 03 '22 13:10 nikkhils