PostgreSQL bytea support
Hello,
First off, thank you for this project. So far I have found it to be extremely useful.
I'm using:
- SOCI 3.2.3
- PostgreSQL 9.6
- Microsoft Visual Studio 2010
- Windows 10
I am attempting to store a binary file in a PostgreSQL database using the bytea type. I'm stuck because the following query is fine via pgAdmin:
INSERT INTO bytea_test ( contents ) values ( decode( '‡ÓÀ¤Jb#ᇠ', 'escape' ) );
This is fine: Query returned successfully: one row affected
Here is what I have using SOCI:
sql << "INSERT INTO bytea_test ( contents ) values ( decode( '‡ÓÀ¤Jb#ᇠ', 'escape' ) )";
However, this causes an error: Cannot execute query. ERROR: invalid byte sequence for encoding "UTF8": 0x87
This error comes from PostgreSQL but the "same" query works from pgAdmin. I have tried a number of different things, but I keep coming back to this error. Could it be caused by a difference in encoding for pgAdmin vs Visual Studio? Does the encoding even matter for the bytea type?
Quick response: perhaps it will help if look at the bytea test as an example:
https://github.com/SOCI/soci/blob/master/tests/postgresql/test-postgresql.cpp#L544-L589
Thanks for your reply.
The example using 0x0A0B0C0D works. However, try something like 0xFAFBFCFD and the same error is created: Cannot execute query. ERROR: invalid byte sequence for encoding "UTF8".
This query is fine in pgAdmin:
INSERT INTO bytea_test ( contents ) VALUES ( 'ýüûú' )
Any additional ideas regarding this issue? I dug through the Postgres logs and found this:
SOCI
...
2016-09-07 12:51:30 EDT STATEMENT: BEGIN
2016-09-07 12:51:30 EDT DEBUG: 00000: CommitTransactionCommand
2016-09-07 12:51:30 EDT LOCATION: finish_xact_command, postgres.c:2457
2016-09-07 12:51:30 EDT STATEMENT: BEGIN
2016-09-07 12:51:32 EDT ERROR: 22021: invalid byte sequence for encoding "UTF8": 0x87
2016-09-07 12:51:32 EDT LOCATION: report_invalid_encoding, wchar.c:2017
...
pgAdmin
...
2016-09-07 12:58:46 EDT STATEMENT: INSERT INTO bytea_test ( contents ) values ( decode('‡ÓÀ¤Jb#ᇠ', 'escape') );
2016-09-07 12:58:46 EDT DEBUG: 00000: CommitTransactionCommand
2016-09-07 12:58:46 EDT LOCATION: finish_xact_command, postgres.c:2457
2016-09-07 12:58:46 EDT STATEMENT: INSERT INTO bytea_test ( contents ) values ( decode('‡ÓÀ¤Jb#ᇠ', 'escape') );
2016-09-07 12:58:46 EDT DEBUG: 00000: CommitTransaction
2016-09-07 12:58:46 EDT LOCATION: ShowTransactionState, xact.c:4953
...
This might not be a fix but would using a parameter instead of a literal query be a workaround?
Thanks for the suggestion.
The parameter and literal produce the same error.
I have tried all sorts of things. I stepped through the SOCI source several times. When I used a parameter, I noticed that some exchange functions were used to insert the parameter into the query. Those exchange functions had type information about what was being exchanged. In this case, an x_stdstring. I thought maybe Postgres was incorrectly handling that string as a regular string (UTF8 in my case). However, using a literal has the same effect. Casting the value to a bytea doesn't help either.
Is there some escape function I'm missing? I tried all sorts of combinations of convert_to and decode. Or is there a fundamental difference between how pgAdmin queries are handled vs. SOCI. The logs seem to indicate some difference, but I don't know that it is significant.
Additionally, I had trouble getting the Postgres logs to show the erroneous query. I feel like that might shed some light on the issue.
I guess (but it's just a guess) that if PostgreSQL server expects all its input to be UTF-8 encoded, this query might not even reach the layer doing the logging...
It's really weird that it should apply to the parameters however. Surely there must be a way to pass arbitrary binary data?
Surely there must be a way to pass arbitrary binary data?
My thoughts exactly! I still feel like I'm missing something. It's strange that my example query above works using pgAdmin and not using SOCI.
Here is my new SOCI query. In UTF8, E280A1 is hex for the double dagger symbol ‡.
sql << "INSERT INTO bytea_test ( contents ) values ( '\\xE280A1' )";
And from pgAdmin:
SELECT convert_from( contents, 'UTF8' ) FROM bytea_test
Returns ‡.
Progress?
Do I really have to convert all of the characters in a file (or other binary string) to UTF8 octets? By hand? Surely there is a better way.
I was looking at automatic bytea support a while ago, but gave up. The problem was the current way prepared statements are handled as switching to bytea encoding for std::string if strlen(s.data) != s.size() can't work. For now, I have settled to encoding strings to hex manually and decoding them on the server side via SQL. The inverse problem is somewhat simpler, I am using code like the following for the decoding as needed:
static const unsigned char hex_convert[256] = {
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 255, 255,
255, 255, 255, 255, 255, 10, 11, 12, 13, 14, 15, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 10, 11, 12, 13, 14, 15, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255,
};
void unescape_bytea(std::string &str) {
if (str.empty())
return;
const unsigned char *buf =
reinterpret_cast<const unsigned char *>(str.c_str());
auto outpos(str.begin());
if (buf[0] == '\\' && buf[1] == 'x') {
buf += 2;
while (*buf) {
if (*buf == ' ') {
++buf;
continue;
}
unsigned char d1 = hex_convert[buf[0]];
unsigned char d2 = hex_convert[buf[1]];
if (d1 == 255 || d2 == 255)
throw soci::soci_error("Invalid bytea encoding found.");
*outpos++ = char(d1 * 16 + d2);
buf += 2;
}
str.resize(outpos - str.begin());
return;
}
while (*buf) {
if (*buf != '\\') {
*outpos++ = char(*buf++);
continue;
}
++buf;
if (*buf == '\\') {
*outpos++ = char(*buf++);
continue;
}
if (buf[0] < '0' || buf[0] > '3' || buf[1] < '0' || buf[1] > '7' ||
buf[2] < '0' || buf[2] > '7')
throw soci::soci_error("Invalid bytea encoding found.");
*outpos++ = char(buf[0] * 64 + buf[1] * 8 + buf[2]);
buf += 3;
}
}