soci PostgreSQL bytea support

Hello,

First off, thank you for this project. So far I have found it to be extremely useful.

I'm using:

SOCI 3.2.3
PostgreSQL 9.6
Microsoft Visual Studio 2010
Windows 10

I am attempting to store a binary file in a PostgreSQL database using the bytea type. I'm stuck because the following query is fine via pgAdmin:

INSERT INTO bytea_test ( contents ) values ( decode( '‡ÓÀ¤Jb#á‡ ', 'escape' ) );

This is fine: Query returned successfully: one row affected

Here is what I have using SOCI:

sql << "INSERT INTO bytea_test ( contents ) values ( decode( '‡ÓÀ¤Jb#á‡ ', 'escape' ) )";

However, this causes an error: Cannot execute query. ERROR: invalid byte sequence for encoding "UTF8": 0x87

This error comes from PostgreSQL but the "same" query works from pgAdmin. I have tried a number of different things, but I keep coming back to this error. Could it be caused by a difference in encoding for pgAdmin vs Visual Studio? Does the encoding even matter for the bytea type?

Sep 01 '16 14:09 dboone

Quick response: perhaps it will help if look at the bytea test as an example: https://github.com/SOCI/soci/blob/master/tests/postgresql/test-postgresql.cpp#L544-L589

Sep 02 '16 09:09 mloskot

Thanks for your reply.

The example using 0x0A0B0C0D works. However, try something like 0xFAFBFCFD and the same error is created: Cannot execute query. ERROR: invalid byte sequence for encoding "UTF8".

This query is fine in pgAdmin:

INSERT INTO bytea_test ( contents ) VALUES ( 'ýüûú' )

Sep 02 '16 15:09 dboone

Any additional ideas regarding this issue? I dug through the Postgres logs and found this:

SOCI

...
2016-09-07 12:51:30 EDT STATEMENT:  BEGIN
2016-09-07 12:51:30 EDT DEBUG:  00000: CommitTransactionCommand
2016-09-07 12:51:30 EDT LOCATION:  finish_xact_command, postgres.c:2457
2016-09-07 12:51:30 EDT STATEMENT:  BEGIN
2016-09-07 12:51:32 EDT ERROR:  22021: invalid byte sequence for encoding "UTF8": 0x87
2016-09-07 12:51:32 EDT LOCATION:  report_invalid_encoding, wchar.c:2017
...

pgAdmin

...
2016-09-07 12:58:46 EDT STATEMENT:  INSERT INTO bytea_test ( contents ) values ( decode('‡ÓÀ¤Jb#á‡ ', 'escape') );
2016-09-07 12:58:46 EDT DEBUG:  00000: CommitTransactionCommand
2016-09-07 12:58:46 EDT LOCATION:  finish_xact_command, postgres.c:2457
2016-09-07 12:58:46 EDT STATEMENT:  INSERT INTO bytea_test ( contents ) values ( decode('‡ÓÀ¤Jb#á‡ ', 'escape') );
2016-09-07 12:58:46 EDT DEBUG:  00000: CommitTransaction
2016-09-07 12:58:46 EDT LOCATION:  ShowTransactionState, xact.c:4953
...

Sep 07 '16 17:09 dboone

This might not be a fix but would using a parameter instead of a literal query be a workaround?

Sep 07 '16 23:09 vadz

Thanks for the suggestion.

The parameter and literal produce the same error.

I have tried all sorts of things. I stepped through the SOCI source several times. When I used a parameter, I noticed that some exchange functions were used to insert the parameter into the query. Those exchange functions had type information about what was being exchanged. In this case, an x_stdstring. I thought maybe Postgres was incorrectly handling that string as a regular string (UTF8 in my case). However, using a literal has the same effect. Casting the value to a bytea doesn't help either.

Is there some escape function I'm missing? I tried all sorts of combinations of convert_to and decode. Or is there a fundamental difference between how pgAdmin queries are handled vs. SOCI. The logs seem to indicate some difference, but I don't know that it is significant.

Sep 08 '16 00:09 dboone

Additionally, I had trouble getting the Postgres logs to show the erroneous query. I feel like that might shed some light on the issue.

Sep 08 '16 00:09 dboone

I guess (but it's just a guess) that if PostgreSQL server expects all its input to be UTF-8 encoded, this query might not even reach the layer doing the logging...

It's really weird that it should apply to the parameters however. Surely there must be a way to pass arbitrary binary data?

Sep 08 '16 01:09 vadz

Surely there must be a way to pass arbitrary binary data?

My thoughts exactly! I still feel like I'm missing something. It's strange that my example query above works using pgAdmin and not using SOCI.

Sep 08 '16 01:09 dboone

Here is my new SOCI query. In UTF8, E280A1 is hex for the double dagger symbol ‡.

sql << "INSERT INTO bytea_test ( contents ) values ( '\\xE280A1' )";

And from pgAdmin:

SELECT convert_from( contents, 'UTF8' ) FROM bytea_test

Returns ‡.

Progress?

Do I really have to convert all of the characters in a file (or other binary string) to UTF8 octets? By hand? Surely there is a better way.

Sep 08 '16 15:09 dboone

I was looking at automatic bytea support a while ago, but gave up. The problem was the current way prepared statements are handled as switching to bytea encoding for std::string if strlen(s.data) != s.size() can't work. For now, I have settled to encoding strings to hex manually and decoding them on the server side via SQL. The inverse problem is somewhat simpler, I am using code like the following for the decoding as needed:

static const unsigned char hex_convert[256] = {
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 0,   1,   2,   3,   4,   5,   6,   7,   8,   9,   255, 255,
    255, 255, 255, 255, 255, 10,  11,  12,  13,  14,  15,  255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 10,  11,  12,  13,  14,  15,  255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255,

    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255,
};

void unescape_bytea(std::string &str) {
  if (str.empty())
    return;
  const unsigned char *buf =
      reinterpret_cast<const unsigned char *>(str.c_str());
  auto outpos(str.begin());
  if (buf[0] == '\\' && buf[1] == 'x') {
    buf += 2;
    while (*buf) {
      if (*buf == ' ') {
        ++buf;
        continue;
      }
      unsigned char d1 = hex_convert[buf[0]];
      unsigned char d2 = hex_convert[buf[1]];
      if (d1 == 255 || d2 == 255)
        throw soci::soci_error("Invalid bytea encoding found.");
      *outpos++ = char(d1 * 16 + d2);
      buf += 2;
    }
    str.resize(outpos - str.begin());
    return;
  }
  while (*buf) {
    if (*buf != '\\') {
      *outpos++ = char(*buf++);
      continue;
    }
    ++buf;
    if (*buf == '\\') {
      *outpos++ = char(*buf++);
      continue;
    }
    if (buf[0] < '0' || buf[0] > '3' || buf[1] < '0' || buf[1] > '7' ||
        buf[2] < '0' || buf[2] > '7')
      throw soci::soci_error("Invalid bytea encoding found.");
    *outpos++ = char(buf[0] * 64 + buf[1] * 8 + buf[2]);
    buf += 3;
  }
}

Sep 08 '16 15:09 jsonn