citus icon indicating copy to clipboard operation
citus copied to clipboard

Flaky columnar regression test

Open hanefi opened this issue 2 years ago • 3 comments

We started to see some flaky results in columnar_first_row_number test.

After 74ce210f8b23f592d2b4308fb4a0f69f7dd73e88 was committed, the same regression test failed on both PG13 and PG14 on master branch with the same diff.

diff -dU10 -w /home/circleci/project/src/test/regress/expected/columnar_first_row_number.out /home/circleci/project/src/test/regress/results/columnar_first_row_number.out
--- /home/circleci/project/src/test/regress/expected/columnar_first_row_number.out.modified	2022-05-25 14:52:42.489302871 +0000
+++ /home/circleci/project/src/test/regress/results/columnar_first_row_number.out.modified	2022-05-25 14:52:42.501303178 +0000
@@ -5,29 +5,21 @@
 BEGIN;
   -- we don't use same first_row_number even if the xact is rollback'ed
   INSERT INTO col_table_1 SELECT i FROM generate_series(1, 11) i;
 ROLLBACK;
 INSERT INTO col_table_1 SELECT i FROM generate_series(1, 12) i;
 ALTER TABLE col_table_1 SET (columnar.stripe_row_limit = 1000);
 INSERT INTO col_table_1 SELECT i FROM generate_series(1, 2350) i;
 SELECT row_count, first_row_number FROM columnar.stripe a
 WHERE a.storage_id = columnar.get_storage_id('col_table_1'::regclass)
 ORDER BY stripe_num;
- row_count | first_row_number
----------------------------------------------------------------------
-        10 |                1
-        12 |           300001
-      1000 |           450001
-      1000 |           451001
-       350 |           452001
-(5 rows)
-
+ERROR:  could not open relation with OID 17180
 VACUUM FULL col_table_1;
 -- show that we properly update first_row_number after VACUUM FULL
 SELECT row_count, first_row_number FROM columnar.stripe a
 WHERE a.storage_id = columnar.get_storage_id('col_table_1'::regclass)
 ORDER BY stripe_num;
  row_count | first_row_number 
 -----------+------------------
       1000 |                1
       1000 |             1001
        372 |             2001

@jeff-davis do you think that this may be related to wraparound bugfix? Any ideas on why this test failed?

hanefi avatar May 31 '22 22:05 hanefi

happened again

onderkalaci avatar Jun 13 '22 09:06 onderkalaci

I didn't repro the failure, but I added a couple lines to the SQL file and ran it locally, and I see:

SELECT 'col_table_1'::regclass::oid AS col_table_1;
 col_table_1
---------------------------------------------------------------------
       17180
(1 row)

SELECT oid as col_table_1 FROM pg_class where relname='col_table_1';
 col_table_1
---------------------------------------------------------------------
       17180
(1 row)

The same table seems to get the same OID deterministically, because there are no parallel tests before that. And col_table_1 appears to have the OID 17180, which matches the test diff.

That's strange: it means that it's failing to open a totally valid OID in your test. The only reference in the query to col_table_1 is columnar.get_storage_id('col_table_1'::regclass). The cast to regclass doesn't seem to actually open the relation, so it seems the error is likely thrown within get_storage_id(). The function is volatile, so it shouldn't be called at plan time.

I'm trying to figure out how it's possible that the plain relation_open() in get_storage_id() manages to fail with a totally valid OID. I must be missing something simple, or there must be something strange going on.

jeff-davis avatar Jun 15 '22 00:06 jeff-davis

Not sure if I'm just breaking something but I was able to repro this issue while running check-columnar locally by

  1. opening a psql session with server the test was going to run against
  2. run make clean
  3. run make install -j9 && make -C src/test/regress/ check-columnar

yxu2162 avatar Aug 11 '22 18:08 yxu2162

Probably not applicable anymore as we don't see any more occurrences of them.

onurctirtir avatar Nov 24 '23 09:11 onurctirtir