Prevent RepSetTableHash entry fields from being freed during invalidation callback
The repset_relcache_invalidate_callback() was incorrectly freeing entry->att_list and entry->row_filter directly during cache invalidation. This is unsafe behavior, as these fields may still be needed by other functions that use the entry even after cache invalidation.
Modifing the repset_relcache_invalidate_callback() to only set entry as invalid(entry->isvalid = false) and moving the cleanup of fields to get_table_replication_info() where reinitializes the entry safely.
See #496 issue for more details.
I'm not sure a deterministic test case can be created for this, since relcache invalidation can happen at any point during execution, it difficult to reproduce cache invalidation with the exact scenario again and again.
If you have any suggestions or any specific cases you think should be explicitly covered let me know?
Does the current test suite fail with debug_discard_caches=1, which removes the randomness?
- If no: can you add a test that does fail under debug_discard_caches=1?
- If yes, does this pull request remove at least one failure and not add failures? That would indicate existing coverage suffices.
Currently, debug_discard_caches=0 is used in regression tests. Even with debug_discard_caches=1, it is not possible to reproduce the exact invalidation scenario observed in #496 during regression test.
Suggesting Test Case:
STEP 1: Populate a cache entry in RepSetTableHash for sample_table with its row_filter.
SELECT pglogical.create_replication_set('sample_publisher_set', true, true, true, true);
SELECT pglogical.replication_set_add_table(set_name := 'sample_publisher_set', relation := 'public.sample_table', row_filter := $$ department_id >= 1 and department_id <= 20 $$);
SELECT * FROM pglogical.table_data_filtered(NULL::"public"."sample_table", '"public"."sample_table"'::regclass, ARRAY['sample_publisher_set']);
STEP 2: Modify the sample_table's row_filter by removing and re-adding same table with new_row_filter to sample_publisher_set. This will trigger cache invalidation in repset_relcache_invalidate_callback() for the sample_table entry.
SELECT pglogical.replication_set_remove_table('sample_publisher_set', '"public"."sample_table"'::regclass);
SELECT pglogical.replication_set_add_table(set_name := 'sample_publisher_set', relation := 'public.sample_table', row_filter := $$ department_id >= 21 and department_id <= 40 $$);
STEP 3: Repopulate the same cache entry for sample_table with new_row_filter.
SELECT * FROM pglogical.table_data_filtered(NULL::"public"."sample_table", '"public"."sample_table"'::regclass, ARRAY['sample_publisher_set']);
The above test case covers populating the cache, invalidating the cache, and repopulating the cache to invoke all modified functions.
The results from pglogical.table_data_filtered() validate that the row_filter from the cache is applied correctly even after cache invalidation.
Got it. That's elegant. If you add those queries to the test suite in sql/, do they show wrong output without the rest of the change and correct output with the rest of the change? If so, please change sql/ and expected/ to make those queries part of the test suite permanently.
If you add those queries to the test suite in sql/, do they show wrong output without the rest of the change and correct output with the rest of the change?
No, In without the rest of the changes, they don't consistently produce wrong output as result. Because the test can produce wrong results only when cache invalidation happens at the specific point observed in issue, which occurs rarely and is hard to reproduce in regression tests.
The above suggested test case can only improve code coverage in the test suite. should proceed to add test case if needed?
Please do modify sql/ and expected/ with those test queries.
Overall, the goal is for the test suite to contain unambiguous, future-proof instructions sufficient for a developer to reproduce the problem. Does the answer to my previous question change under debug_discard_caches=1? In other words, if you add those queries to the test suite in sql/ and set debug_discard_caches=1, do those queries show wrong output without the non-test changes and correct output with the non-test changes? If yes, please add an SQL comment in the test about debug_discard_caches being the pertinent environment for those queries.
If no, I guess reproducing the failure requires invalidations at some times and also requires absence of invalidations at other times. Is that right? If so, leave SQL comments at the spots needing invalidations. Thanks.
row_filter_sampling does fail trivially due to its expected output not having the test comment you added. With that fixed, things look good. In my testing against v14 (didn't try other versions), none of these got new failures:
make checkw/ debug_discard_caches=1 in regress-postgresql.confmake checkw/o debug_discard_cachesmake check_provew/ debug_discard_caches=1 in t/perl-95-postgresql.confmake check_provew/o debug_discard_caches
debug_discard_caches=1 does exercise the relevant code. The following passes at your pull request, but it crashes if I revert your pglogical_repset.c changes (keeping only the test changes). That's great news:
# raise timeouts, except in primary_key which expects to reach each timeout
sed -i "s/statement_timeout =.*/statement_timeout = '180s';/" \
$(git grep -l 'statement_timeout =' sql expected | grep -v primary_key)
# disable add_table, which hangs
sed -i 's/add_table //' Makefile
make
make install
echo debug_discard_caches=1 >>regress-postgresql.conf
# don't mind failures in primary_key (flaky at HEAD) and multiple_upstreams (fails at HEAD)
make check
Here's the crash (again, with your pglogical_repset.c changes reverted):
2161 ExprState *exprstate = pglogical_prepare_row_filter(row_filter);
(gdb) bt
#0 0x00007f130dc57924 in pglogical_table_data_filtered (fcinfo=0x55f0257cfe78) at pglogical_functions.c:2161
#1 0x000055effe40e42d in ExecMakeTableFunctionResult (setexpr=0x55f025745958, econtext=0x55f025745810, argContext=<optimized out>, expectedDesc=0x55f0257f71a0,
randomAccess=false) at execSRF.c:234
#2 0x000055effe42134b in FunctionNext (node=0x55f0257455f8) at nodeFunctionscan.c:95
#3 0x000055effe40509b in ExecProcNode (node=0x55f0257455f8) at ../../../src/include/executor/executor.h:260
#4 ExecutePlan (queryDesc=0x55f0257bef10, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x55f0257e89a8) at execMain.c:1555
#5 standard_ExecutorRun (queryDesc=0x55f0257bef10, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:360
#6 0x000055effe59e11a in PortalRunSelect (portal=portal@entry=0x55f025787930, forward=forward@entry=true, count=0, count@entry=9223372036854775807,
dest=dest@entry=0x55f0257e89a8) at pquery.c:919
#7 0x000055effe59f8b0 in PortalRun (portal=portal@entry=0x55f025787930, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x55f0257e89a8, altdest=altdest@entry=0x55f0257e89a8, qc=0x7fff4ab3fab0) at pquery.c:763
#8 0x000055effe59b6bc in exec_simple_query (
query_string=0x55f02571b130 "SELECT * FROM pglogical.table_data_filtered(NULL::\"public\".\"sample_rowfilter_table\", '\"public\".\"sample_rowfilter_table\"'::regclass, ARRAY['sample_publisher_set']);") at postgres.c:1182
#9 0x000055effe59d154 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff4ab40000, dbname=<optimized out>, username=<optimized out>) at postgres.c:4571
#10 0x000055effe50afc2 in BackendRun (port=<optimized out>) at postmaster.c:4541
#11 BackendStartup (port=<optimized out>) at postmaster.c:4263
#12 ServerLoop () at postmaster.c:1750
#13 0x000055effe50bfa7 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x55f0256c9fa0) at postmaster.c:1422
#14 0x000055effe244fa1 in main (argc=8, argv=0x55f0256c9fa0) at main.c:211
Moreover, I think the patch is correct. Thanks for writing it. Please update with the following cosmetic changes, or let me know if you'd like me to do them:
- Modify the test file comment to say debug_discard_caches=1 also reveals the bug.
- Update the expected output file, which doesn't have the test comment you added.
- Remove trailing whitespace in new pglogical_repset.c lines
- Change leading spaces to tab in the new MemSet line
Then I'll merge this pull request.
Great, please proceed with your suggested changes and do the merge. Thanks