citus
citus copied to clipboard
Flaky columnar regression test
We started to see some flaky results in columnar_first_row_number
test.
After 74ce210f8b23f592d2b4308fb4a0f69f7dd73e88 was committed, the same regression test failed on both PG13 and PG14 on master
branch with the same diff.
diff -dU10 -w /home/circleci/project/src/test/regress/expected/columnar_first_row_number.out /home/circleci/project/src/test/regress/results/columnar_first_row_number.out
--- /home/circleci/project/src/test/regress/expected/columnar_first_row_number.out.modified 2022-05-25 14:52:42.489302871 +0000
+++ /home/circleci/project/src/test/regress/results/columnar_first_row_number.out.modified 2022-05-25 14:52:42.501303178 +0000
@@ -5,29 +5,21 @@
BEGIN;
-- we don't use same first_row_number even if the xact is rollback'ed
INSERT INTO col_table_1 SELECT i FROM generate_series(1, 11) i;
ROLLBACK;
INSERT INTO col_table_1 SELECT i FROM generate_series(1, 12) i;
ALTER TABLE col_table_1 SET (columnar.stripe_row_limit = 1000);
INSERT INTO col_table_1 SELECT i FROM generate_series(1, 2350) i;
SELECT row_count, first_row_number FROM columnar.stripe a
WHERE a.storage_id = columnar.get_storage_id('col_table_1'::regclass)
ORDER BY stripe_num;
- row_count | first_row_number
----------------------------------------------------------------------
- 10 | 1
- 12 | 300001
- 1000 | 450001
- 1000 | 451001
- 350 | 452001
-(5 rows)
-
+ERROR: could not open relation with OID 17180
VACUUM FULL col_table_1;
-- show that we properly update first_row_number after VACUUM FULL
SELECT row_count, first_row_number FROM columnar.stripe a
WHERE a.storage_id = columnar.get_storage_id('col_table_1'::regclass)
ORDER BY stripe_num;
row_count | first_row_number
-----------+------------------
1000 | 1
1000 | 1001
372 | 2001
@jeff-davis do you think that this may be related to wraparound bugfix? Any ideas on why this test failed?
happened again
I didn't repro the failure, but I added a couple lines to the SQL file and ran it locally, and I see:
SELECT 'col_table_1'::regclass::oid AS col_table_1;
col_table_1
---------------------------------------------------------------------
17180
(1 row)
SELECT oid as col_table_1 FROM pg_class where relname='col_table_1';
col_table_1
---------------------------------------------------------------------
17180
(1 row)
The same table seems to get the same OID deterministically, because there are no parallel tests before that. And col_table_1
appears to have the OID 17180
, which matches the test diff.
That's strange: it means that it's failing to open a totally valid OID in your test. The only reference in the query to col_table_1
is columnar.get_storage_id('col_table_1'::regclass)
. The cast to regclass
doesn't seem to actually open the relation, so it seems the error is likely thrown within get_storage_id()
. The function is volatile
, so it shouldn't be called at plan time.
I'm trying to figure out how it's possible that the plain relation_open()
in get_storage_id()
manages to fail with a totally valid OID. I must be missing something simple, or there must be something strange going on.
Not sure if I'm just breaking something but I was able to repro this issue while running check-columnar
locally by
- opening a psql session with server the test was going to run against
- run
make clean
- run
make install -j9 && make -C src/test/regress/ check-columnar
Probably not applicable anymore as we don't see any more occurrences of them.