babelfish_extensions icon indicating copy to clipboard operation
babelfish_extensions copied to clipboard

[Bug]: T-SQL null byte (0x00) support in strings not compatible with Babelfish (PostgreSQL limitation)

Open bilalkah opened this issue 4 months ago • 1 comments

What happened?

In T-SQL it is valid to store and query strings containing the null byte (0x00). However, in Babelfish for PostgreSQL, attempts to insert or handle such values either fail or produce inconsistent results because PostgreSQL does not support embedded NULs in text types.

This creates a compatibility gap: applications or replication pipelines that move existing T-SQL data into Babelfish will encounter problems if the data contains null bytes. This is particularly relevant because many client databases are likely to have such data.

Discussion / Questions

This seems to come from a PostgreSQL restriction, since it doesn’t allow embedded null bytes in text types. On the other hand, T-SQL does allow them, so applications that rely on this behavior or databases that already contain null bytes will run into problems when moved to Babelfish.

A few things I’d like to ask:

  • Do you see this being addressed in Babelfish in the future (for example with some kind of escape handling or different storage approach)?
  • Or should this be treated as a limitation that users need to work around on their own?
  • If it’s a limitation, would it make sense to call it out clearly in the compatibility documentation?

Repro Steps (JDBC example)

try (Statement st = conn.createStatement()) {
    st.execute("IF OBJECT_ID('dbo.test1','U') IS NOT NULL DROP TABLE dbo.test1");
    st.execute("""
        CREATE TABLE dbo.test1(
            id  INT PRIMARY KEY,
            val CHAR(5) NULL
        )""");
}

// A) Literal hex 0x00 cast to CHAR(1)
try (PreparedStatement ps =
         conn.prepareStatement("INSERT INTO dbo.test1(id,val) VALUES (?, CAST(0x00 AS CHAR(1)))")) {
    ps.setInt(1, 1);
    ps.executeUpdate();
    System.out.println("A) succeeded");
} catch (SQLException e) {
    System.out.println("A) expected failure: " + e.getMessage());
}

// B) Param as VARBINARY then cast to CHAR(1) on server
try (PreparedStatement ps =
         conn.prepareStatement("INSERT INTO dbo.test1(id,val) VALUES (?, CAST(? AS CHAR(1)))")) {
    ps.setInt(1, 2);
    ps.setBytes(2, new byte[]{0x00});
    ps.executeUpdate();
    System.out.println("B) succeeded");
} catch (SQLException e) {
    System.out.println("B) expected failure: " + e.getMessage());
}

// C) Param as NVARCHAR containing Unicode NUL
try (PreparedStatement ps =
         conn.prepareStatement("INSERT INTO dbo.test1(id,val) VALUES (?, ?)")) {
    ps.setInt(1, 3);
    ps.setString(2, "\u0000");
    ps.executeUpdate();
    System.out.println("C) succeeded");
} catch (SQLException e) {
    System.out.println("C) expected failure: " + e.getMessage());
}

// D) Bytes directly into CHAR(5)
try (PreparedStatement ps =
         conn.prepareStatement("INSERT INTO dbo.test1(id,val) VALUES (?, ?)")) {
    ps.setInt(1, 4);
    ps.setBytes(2, new byte[]{0x00});
    ps.executeUpdate();
    System.out.println("D) succeeded");
} catch (SQLException e) {
    System.out.println("D) expected failure: " + e.getMessage());
}

// E) Try to push UTF-8 bytes (including 0x00) into CHAR(5)
try (PreparedStatement ps =
         conn.prepareStatement("INSERT INTO dbo.test1(id,val) VALUES (?, ?)")) {
    String s = "\u0000";
    byte[] utf8 = s.getBytes(java.nio.charset.StandardCharsets.UTF_8);
    ps.setInt(1, 5);
    ps.setBytes(2, utf8);
    ps.executeUpdate();
    System.out.println("E) succeeded");
} catch (SQLException e) {
    System.out.println("E) expected failure: " + e.getMessage());
}

// Show whatever got in (likely nothing)
try (Statement st = conn.createStatement();
    ResultSet rs = st.executeQuery("SELECT id, val FROM dbo.test1 ORDER BY id")) {
    while (rs.next()) {
        int id = rs.getInt(1);
        String v = rs.getString(2); // may be null
        System.out.println("row: id=" + id + ", val='" + v + "', len=" + (v == null ? 0 : v.length()));
    }
}

Babelfish output

A) succeeded
B) succeeded
C) expected failure: invalid UTF16 byte sequence - code point 0 not supported
D) succeeded
E) succeeded

row: id=1, val='0    ', len=5
row: id=2, val='0    ', len=5
row: id=4, val='0x00 ', len=5
row: id=5, val='0x00 ', len=5

T-SQL output

A) succeeded
B) succeeded
C) succeeded
D) succeeded
E) succeeded

row: id=1, val='    ', len=5
row: id=2, val='    ', len=5
row: id=3, val='    ', len=5
row: id=4, val='    ', len=5
row: id=5, val='    ', len=5

Version

BABEL_5_X_DEV (Default)

Extension

babelfishpg_tds

Which flavor of Linux are you using when you see the bug?

No response

Relevant log output


Code of Conduct

  • [x] I agree to follow this project's Code of Conduct.

bilalkah avatar Aug 28 '25 20:08 bilalkah

Let me add the piece of code which makes that sanity check. file: babelfish_extensions\contrib\babelfishpg_tds\src\backend\tds\tdsutils.c function name: static inline int32_t GetUTF16CodePoint(const unsigned char *in, int len, int *consumed)

	code1 = in[1] << 8 | in[0];
	if (code1 < 0xD800 || code1 >= 0xE000)
	{
		/*
		 * This is a single 16 bit code point, which is equal to code1.
		 * PostgreSQL does not support NUL bytes in character data as it
		 * internally needs the ability to convert any datum to a NUL
		 * terminated C-string without explicit length information.
		 */
		if (code1 == 0)
			ereport(ERROR,
					(errcode(ERRCODE_DATA_EXCEPTION),
					 errmsg("invalid UTF16 byte sequence - "
							"code point 0 not supported")));
		if (consumed)
			*consumed = 2;
		return (int32_t) code1;
	}

bilalkah avatar Aug 29 '25 11:08 bilalkah