osmdbt and pgoutput
I'm moving my comment to a new issue, as requested by @joto
Since compiling and deploying a custom plugin seemed a bit cumbersome, I've been exploring the option to use pgoutput, a fast binary format-based plugin that's built into Postgresql.
pgoutput is the standard logical decoding output plug-in in PostgreSQL 10+. It is maintained by the PostgreSQL community, and used by PostgreSQL itself for logical replication. This plug-in is always present so no additional libraries need to be installed.[^1]
osmdbt-pgoutput interprets the raw replication event stream directly and translates it into the same text representation like the osm-logical plugin today. Most of what osm-logical plugin has been doing before has moved to osmdbt-get-log.cpp and pgoutput.[ch]pp. All command line tools should work like before. Configuration wise, a new database parameter publication was added to the osmdbt.yaml file.
Maybe in a long term, this approach could simplify our setup, or make it easier to use osmdbt in cloud environments with limited options for deploying custom plugins.
Link: https://github.com/mmd-osm/osmdbt-pgoutput
[^1]: Quoting https://debezium.io/documentation/reference/stable/connectors/postgresql.html
I couldn't find any real documentation on the pgoutput plugin. I am all for using something that's already there instead our own implementation, but is it intended as something that "the public" can use or as something internal to PostgrSQL? We don't want to switch and then they change their internal representation or something and our code breaks?
Debezium seems to be one of the more prominent external consumers interfacing directly with pgoutput. This is matching our use case, with Apache Kafka as a destination, rather than some text files.
The binary format itself is documented on the postgresql.org page: Logical Replication Message Formats
Since pgoutput typically supports multiple versions of its binary protocol, clients can explicitly request one particular version when connecting to the database. In the case of osmdbt-pgoutput that's version 1. As long as future Postgresql versions still support this version, we're good.
I've noticed some minor differences across different Postgresql versions, such as omitting an empty BEGIN / COMMIT pair. From a functional point of view, this has no impact. However, some unit tests that are relying on number of rows might see different results here. I've already considered this point in the test cases.
Some links on how different projects are using pgoutput:
-
Debezium - replicate to Apache Kafka
- Logical Decoding Output Plug-in Installation for PostgreSQL: the connector supports PostgreSQL 10+ logical replication streaming using pgoutput. This means that a logical decoding output plug-in is no longer necessary
- Deprecate wal2json support : All users should move to the pgoutput or decoderbufs plug-ins.
- Postgresql pgsql-bugs Mailing list discussion on pgoutput issue : Then consume the events using any client, for instance Debezium, or a bespoke consumer application.
- Npsql .NET client: Logical and Physical Replication : The modern, recommended way to perform logical replication was introduced in PostgreSQL 10 - see the PostgreSQL documentation. This method, using the built-in pgoutput replication plugin, streams efficient, binary messages to represent database updates such as INSERT, UPDATE and DELETE
- Node JS client to consume pgoutput binary format: https://www.npmjs.com/package/pg-logical-replication
There's probably much more out there. If I find more interesting links, I will add them to the list.