pglogical icon indicating copy to clipboard operation
pglogical copied to clipboard

Prevent RepSetTableHash entry fields from being freed during invalidation callback

Open VishalPrasanna opened this issue 7 months ago • 6 comments

The repset_relcache_invalidate_callback() was incorrectly freeing entry->att_list and entry->row_filter directly during cache invalidation. This is unsafe behavior, as these fields may still be needed by other functions that use the entry even after cache invalidation.

Modifing the repset_relcache_invalidate_callback() to only set entry as invalid(entry->isvalid = false) and moving the cleanup of fields to get_table_replication_info() where reinitializes the entry safely.

See #496 issue for more details.

VishalPrasanna avatar May 26 '25 03:05 VishalPrasanna

I'm not sure a deterministic test case can be created for this, since relcache invalidation can happen at any point during execution, it difficult to reproduce cache invalidation with the exact scenario again and again.

If you have any suggestions or any specific cases you think should be explicitly covered let me know?

VishalPrasanna avatar Jun 02 '25 02:06 VishalPrasanna

Does the current test suite fail with debug_discard_caches=1, which removes the randomness?

  • If no: can you add a test that does fail under debug_discard_caches=1?
  • If yes, does this pull request remove at least one failure and not add failures? That would indicate existing coverage suffices.

nmisch avatar Jun 03 '25 02:06 nmisch

Currently, debug_discard_caches=0 is used in regression tests. Even with debug_discard_caches=1, it is not possible to reproduce the exact invalidation scenario observed in #496 during regression test.

Suggesting Test Case:


STEP 1: Populate a cache entry in RepSetTableHash for sample_table with its row_filter.

SELECT pglogical.create_replication_set('sample_publisher_set', true, true, true, true);
SELECT pglogical.replication_set_add_table(set_name := 'sample_publisher_set', relation := 'public.sample_table', row_filter := $$ department_id >= 1  and department_id <= 20 $$);
SELECT * FROM pglogical.table_data_filtered(NULL::"public"."sample_table", '"public"."sample_table"'::regclass, ARRAY['sample_publisher_set']);

STEP 2: Modify the sample_table's row_filter by removing and re-adding same table with new_row_filter to sample_publisher_set. This will trigger cache invalidation in repset_relcache_invalidate_callback() for the sample_table entry.

SELECT pglogical.replication_set_remove_table('sample_publisher_set', '"public"."sample_table"'::regclass);
SELECT pglogical.replication_set_add_table(set_name := 'sample_publisher_set', relation := 'public.sample_table', row_filter := $$ department_id >= 21  and department_id <= 40 $$);

STEP 3: Repopulate the same cache entry for sample_table with new_row_filter.

SELECT * FROM pglogical.table_data_filtered(NULL::"public"."sample_table", '"public"."sample_table"'::regclass, ARRAY['sample_publisher_set']);

The above test case covers populating the cache, invalidating the cache, and repopulating the cache to invoke all modified functions. The results from pglogical.table_data_filtered() validate that the row_filter from the cache is applied correctly even after cache invalidation.

VishalPrasanna avatar Jun 17 '25 09:06 VishalPrasanna

Got it. That's elegant. If you add those queries to the test suite in sql/, do they show wrong output without the rest of the change and correct output with the rest of the change? If so, please change sql/ and expected/ to make those queries part of the test suite permanently.

nmisch avatar Jun 17 '25 16:06 nmisch

If you add those queries to the test suite in sql/, do they show wrong output without the rest of the change and correct output with the rest of the change?

No, In without the rest of the changes, they don't consistently produce wrong output as result. Because the test can produce wrong results only when cache invalidation happens at the specific point observed in issue, which occurs rarely and is hard to reproduce in regression tests.

The above suggested test case can only improve code coverage in the test suite. should proceed to add test case if needed?

VishalPrasanna avatar Jun 18 '25 06:06 VishalPrasanna

Please do modify sql/ and expected/ with those test queries.

Overall, the goal is for the test suite to contain unambiguous, future-proof instructions sufficient for a developer to reproduce the problem. Does the answer to my previous question change under debug_discard_caches=1? In other words, if you add those queries to the test suite in sql/ and set debug_discard_caches=1, do those queries show wrong output without the non-test changes and correct output with the non-test changes? If yes, please add an SQL comment in the test about debug_discard_caches being the pertinent environment for those queries.

If no, I guess reproducing the failure requires invalidations at some times and also requires absence of invalidations at other times. Is that right? If so, leave SQL comments at the spots needing invalidations. Thanks.

nmisch avatar Jun 20 '25 03:06 nmisch

row_filter_sampling does fail trivially due to its expected output not having the test comment you added. With that fixed, things look good. In my testing against v14 (didn't try other versions), none of these got new failures:

  • make check w/ debug_discard_caches=1 in regress-postgresql.conf
  • make check w/o debug_discard_caches
  • make check_prove w/ debug_discard_caches=1 in t/perl-95-postgresql.conf
  • make check_prove w/o debug_discard_caches

debug_discard_caches=1 does exercise the relevant code. The following passes at your pull request, but it crashes if I revert your pglogical_repset.c changes (keeping only the test changes). That's great news:

# raise timeouts, except in primary_key which expects to reach each timeout
sed -i "s/statement_timeout =.*/statement_timeout = '180s';/" \
  $(git grep -l 'statement_timeout =' sql expected | grep -v primary_key)
# disable add_table, which hangs
sed -i 's/add_table //' Makefile
make
make install
echo debug_discard_caches=1 >>regress-postgresql.conf
# don't mind failures in primary_key (flaky at HEAD) and multiple_upstreams (fails at HEAD)
make check

Here's the crash (again, with your pglogical_repset.c changes reverted):

2161                    ExprState  *exprstate = pglogical_prepare_row_filter(row_filter);
(gdb) bt
#0  0x00007f130dc57924 in pglogical_table_data_filtered (fcinfo=0x55f0257cfe78) at pglogical_functions.c:2161
#1  0x000055effe40e42d in ExecMakeTableFunctionResult (setexpr=0x55f025745958, econtext=0x55f025745810, argContext=<optimized out>, expectedDesc=0x55f0257f71a0, 
    randomAccess=false) at execSRF.c:234
#2  0x000055effe42134b in FunctionNext (node=0x55f0257455f8) at nodeFunctionscan.c:95
#3  0x000055effe40509b in ExecProcNode (node=0x55f0257455f8) at ../../../src/include/executor/executor.h:260
#4  ExecutePlan (queryDesc=0x55f0257bef10, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x55f0257e89a8) at execMain.c:1555
#5  standard_ExecutorRun (queryDesc=0x55f0257bef10, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:360
#6  0x000055effe59e11a in PortalRunSelect (portal=portal@entry=0x55f025787930, forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
    dest=dest@entry=0x55f0257e89a8) at pquery.c:919
#7  0x000055effe59f8b0 in PortalRun (portal=portal@entry=0x55f025787930, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
    run_once=run_once@entry=true, dest=dest@entry=0x55f0257e89a8, altdest=altdest@entry=0x55f0257e89a8, qc=0x7fff4ab3fab0) at pquery.c:763
#8  0x000055effe59b6bc in exec_simple_query (
    query_string=0x55f02571b130 "SELECT * FROM pglogical.table_data_filtered(NULL::\"public\".\"sample_rowfilter_table\", '\"public\".\"sample_rowfilter_table\"'::regclass, ARRAY['sample_publisher_set']);") at postgres.c:1182
#9  0x000055effe59d154 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7fff4ab40000, dbname=<optimized out>, username=<optimized out>) at postgres.c:4571
#10 0x000055effe50afc2 in BackendRun (port=<optimized out>) at postmaster.c:4541
#11 BackendStartup (port=<optimized out>) at postmaster.c:4263
#12 ServerLoop () at postmaster.c:1750
#13 0x000055effe50bfa7 in PostmasterMain (argc=argc@entry=8, argv=argv@entry=0x55f0256c9fa0) at postmaster.c:1422
#14 0x000055effe244fa1 in main (argc=8, argv=0x55f0256c9fa0) at main.c:211

Moreover, I think the patch is correct. Thanks for writing it. Please update with the following cosmetic changes, or let me know if you'd like me to do them:

  • Modify the test file comment to say debug_discard_caches=1 also reveals the bug.
  • Update the expected output file, which doesn't have the test comment you added.
  • Remove trailing whitespace in new pglogical_repset.c lines
  • Change leading spaces to tab in the new MemSet line

Then I'll merge this pull request.

nmisch avatar Jul 19 '25 23:07 nmisch

Great, please proceed with your suggested changes and do the merge. Thanks

VishalPrasanna avatar Jul 24 '25 16:07 VishalPrasanna