Filter replication publication columns
Following https://github.com/electric-sql/electric/issues/1804 and corresponding PR https://github.com/electric-sql/electric/pull/1829
We want to alter the publication to only send over the selected/required columns over the replication stream, which would be the selected columns via the column parameter, and potentially also any columns referenced in the where clause if specified.
While PG does support partial replication of a table (at least for PG15 and above I think), there are some warnings in the docs:
it is [...] possible to get into [a problematic] situation by adding or altering column lists on the publication side after a subscription has been created. This means changing the column lists of tables on publications that are already subscribed could lead to errors being thrown on the subscriber side.
This implies that even with a single publication and subscription, altering the column lists of tables might lead to errors - this needs to be investigated.
After discussing, we think the warning does not affect us, since we're impersonating the protocl and we only have one subscriber for one publication anyways.
Short update on issue that I ran into with this:
It seems that we cannot use REPLICA IDENTITY as FULL, which always sends the old column values of updated rows, if we have column filtering on the publication (see docs)
The error that I get if I do is:
** (Postgrex.Error) ERROR 42P10 (invalid_column_reference) cannot update table "wide_table"
Column list used by the publication does not cover the replica identity.
We don't need to set REPLICA IDENTITY to FULL all the times, so we should avoid setting it for the sake of filtering columns in the publication.
We need FULL:
- Table doesn't have a primary key defined
- the developer requested a shape with param
replica=full
In case of a developer requesting a shape with replica=full and that shape is deleted, we should ideally reset the REPLICA IDENTITY back to the default. Note that this is a sensitive operation because it might causes deadlocks.
One more issue discovered is that a where clause on the publication can only reference columns covered by the replica identity, which means that if the where clause references anything other than a primary key we still need replica identity full:
A row filter expression (i.e., the WHERE clause) must contain only columns that are covered by the REPLICA IDENTITY, in order for UPDATE and DELETE operations to be published. For publication of INSERT operations, any column may be used in the WHERE expression.
See docs for more info
We can still configure it on a need-only-basis, but wanted to flag up that there's more cases where it is required.
We have discussed removing publication where clause filters, as we've seen that it is actually slower than doing the filtering in Elixir (...maybe not in all cases, we'd have to check), but if we do this than we'd be good with keeping replica identity default
Can you describe the other issue you mentioned in our call?
The other issue with filtering columns, which requires setting REPLICA IDENTITY DEFAULT, is the following:
Suppose you have a table foo (id, value1, value2, value3)
- Create a Shape A that selects columns
id,value1 - Create a Shape B that selects columns
id,value2 - At this point the publication would only replicate
id,value1,value2, soREPLICA IDENTITYis set toDEFAULTand thus update statements only carry the old values of the primary key cols (i.e. oldidvalues) - An update is made to a row with
id=1such that onlyvalue2is affected - The update comes through, and we only get the full new value of the row with
id=1, but can't determine that onlyvalue2was touched by this update since we do not get the old values other than the primary key - We have to send this update operation to both Shape A and Shape B, even though this update should only have gone to Shape B since Shape A does not care about
value2.
Therefore by using REPLICA IDENTITY DEFAULT we lose the capacity to filter out changes to shapes with different selected column subsets because we cannot determine whether update operations are relevant or not.
The cost would be on the number of shapes defined for a table with different column filters. Do we think there could be many? On the other end, how much would we save in the general case by filtering out columns in the publication?
We're entering the ground of fine-grained optimization. It's not clear what is best for the different workloads. I'm keen in having the possibility of filtering columns on the publication, but I wonder how and when is the best way to expose this to developers, if it isn't one-size-fits all.
@balegas we have code in place to handle this as well as fallbacks but decided against it as it is very limiting w.r.t. the replica identity. LMK if we want to keep this issue open or not.