electric Filter replication publication columns

Following https://github.com/electric-sql/electric/issues/1804 and corresponding PR https://github.com/electric-sql/electric/pull/1829

We want to alter the publication to only send over the selected/required columns over the replication stream, which would be the selected columns via the column parameter, and potentially also any columns referenced in the where clause if specified.

While PG does support partial replication of a table (at least for PG15 and above I think), there are some warnings in the docs:

it is [...] possible to get into [a problematic] situation by adding or altering column lists on the publication side after a subscription has been created. This means changing the column lists of tables on publications that are already subscribed could lead to errors being thrown on the subscriber side.

This implies that even with a single publication and subscription, altering the column lists of tables might lead to errors - this needs to be investigated.

Oct 10 '24 14:10 msfstef

After discussing, we think the warning does not affect us, since we're impersonating the protocl and we only have one subscriber for one publication anyways.

Nov 27 '24 11:11 balegas

Short update on issue that I ran into with this:

It seems that we cannot use REPLICA IDENTITY as FULL, which always sends the old column values of updated rows, if we have column filtering on the publication (see docs)

The error that I get if I do is:

** (Postgrex.Error) ERROR 42P10 (invalid_column_reference) cannot update table "wide_table"
Column list used by the publication does not cover the replica identity.

Dec 05 '24 17:12 msfstef

We don't need to set REPLICA IDENTITY to FULL all the times, so we should avoid setting it for the sake of filtering columns in the publication. We need FULL:

Table doesn't have a primary key defined
the developer requested a shape with param replica=full

In case of a developer requesting a shape with replica=full and that shape is deleted, we should ideally reset the REPLICA IDENTITY back to the default. Note that this is a sensitive operation because it might causes deadlocks.

Dec 16 '24 11:12 balegas

One more issue discovered is that a where clause on the publication can only reference columns covered by the replica identity, which means that if the where clause references anything other than a primary key we still need replica identity full:

A row filter expression (i.e., the WHERE clause) must contain only columns that are covered by the REPLICA IDENTITY, in order for UPDATE and DELETE operations to be published. For publication of INSERT operations, any column may be used in the WHERE expression.

See docs for more info

We can still configure it on a need-only-basis, but wanted to flag up that there's more cases where it is required.

Dec 17 '24 09:12 msfstef

We have discussed removing publication where clause filters, as we've seen that it is actually slower than doing the filtering in Elixir (...maybe not in all cases, we'd have to check), but if we do this than we'd be good with keeping replica identity default

Can you describe the other issue you mentioned in our call?

Dec 17 '24 12:12 balegas

The other issue with filtering columns, which requires setting REPLICA IDENTITY DEFAULT, is the following:

Suppose you have a table foo (id, value1, value2, value3)

Create a Shape A that selects columns id,value1
Create a Shape B that selects columns id,value2
At this point the publication would only replicate id,value1,value2, so REPLICA IDENTITY is set to DEFAULT and thus update statements only carry the old values of the primary key cols (i.e. old id values)
An update is made to a row with id=1 such that only value2 is affected
The update comes through, and we only get the full new value of the row with id=1, but can't determine that only value2 was touched by this update since we do not get the old values other than the primary key
We have to send this update operation to both Shape A and Shape B, even though this update should only have gone to Shape B since Shape A does not care about value2.

Therefore by using REPLICA IDENTITY DEFAULT we lose the capacity to filter out changes to shapes with different selected column subsets because we cannot determine whether update operations are relevant or not.

Dec 17 '24 12:12 msfstef

The cost would be on the number of shapes defined for a table with different column filters. Do we think there could be many? On the other end, how much would we save in the general case by filtering out columns in the publication?

We're entering the ground of fine-grained optimization. It's not clear what is best for the different workloads. I'm keen in having the possibility of filtering columns on the publication, but I wonder how and when is the best way to expose this to developers, if it isn't one-size-fits all.

Dec 17 '24 12:12 balegas

@balegas we have code in place to handle this as well as fallbacks but decided against it as it is very limiting w.r.t. the replica identity. LMK if we want to keep this issue open or not.

Apr 15 '25 09:04 msfstef