kafka-connect-jdbc icon indicating copy to clipboard operation
kafka-connect-jdbc copied to clipboard

Replacing \u0000 with nothing for PostgreSqlDatabaseDialect

Open oleksandrkovalenko opened this issue 3 years ago • 7 comments

Problem

\u0000 UFT character is not supported by Postgres text field. An attempt to insert fails with error:

org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00`

Solution

We are replacing \u0000 character with nothing

Does this solution apply anywhere else?
  • [ ] yes
  • [x] no
If yes, where?

Test Strategy

Testing done:
  • [x] Unit tests
  • [ ] Integration tests
  • [ ] System tests
  • [ ] Manual tests

Release Plan

oleksandrkovalenko avatar Nov 04 '22 13:11 oleksandrkovalenko

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Nov 04 '22 13:11 CLAassistant

How are your producers adding that byte sequence in the first place? Wouldn't it make more sense to fix your strings at the producer?

OneCricketeer avatar Dec 22 '22 02:12 OneCricketeer

@OneCricketeer In our case, our producer is a Kafka Connector which reads data from MySQL database. Null character \u0000 is supported by MySQL. In any case, we need to change one of the connectors to make this thing work.

oleksandr-blacklane avatar Jan 23 '23 09:01 oleksandr-blacklane

How are your producers adding that byte sequence in the first place? Wouldn't it make more sense to fix your strings at the producer?

Hi @OneCricketeer

According to: https://stackoverflow.com/a/1348551, 0x00 is never a supported character and MySQL produces it because it is a supported char.

PostgreSQL doesn't support storing NULL (\0x00) characters in text fields (this is obviously different from the database NULL value, which is fully supported).

Source: http://www.postgresql.org/docs/9.1/static/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

Often time we don't control the source, so it needs to be done at the consumer. At the same way, we don't even control the tables (100s of tables), and creating a list of all of the text "fields" would be hardly doable.

@oleksandrkovalenko, maybe we could put this behind a flag so we can ensure no behavior change? @OneCricketeer would you agree with the PR if the change was behind a flag?

FreCap avatar Mar 26 '23 14:03 FreCap

If replacing with nothing the correct thing to do? We've elected to replace with the "Replace Character" (\ufffd) via an SMT for the moment. Or is there a good reason to not do this?

roadSurfer avatar Apr 27 '23 14:04 roadSurfer

Stumbled across the same issue. As @FreCap mentioned NULL (\0x00) is not a valid character for postgres. If I'm trying to insert this into as text column it's failing today: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00

As the suggested change will replace this \0x00 with an empty string is technically not the same. The suggested PostgreSQL solution would be to change the column definition from text to bytea to store hex values. For my personal usecase this is not an option. I'd benefit from opt-in replacement from \0x00 to empty string.

As the change is limited to the PostgreSqlDatabaseDialect the expected behavior change would be that if the opt-int option is taken then there is no more error when processing a message containing a \0x00 value for a text field. This would keep the connector up and running. That would be great :)

christianfeurer avatar May 22 '23 15:05 christianfeurer

@FreCap Would you be able to share an example how can I put this change behind a flag? Is it make sense to push this change forward. In our case we fixed those characters in the MySQL database. I believe that could not be possible for all use cases.

oleksandrkovalenko avatar Oct 04 '24 18:10 oleksandrkovalenko