libpqxx icon indicating copy to clipboard operation
libpqxx copied to clipboard

Encoding problems with bytea

Open Broollie opened this issue 9 months ago • 23 comments

Hello, I encountered an interesting issue. I have a table with column of type bytea. Data in this column is of fixed size so in order to ommit manual runtime size checks I implemented string_traits for std::array<std::byte, some size> to use this type directly in prepared statements. And it does work.. until it doesn't. Depending on contents, query may result in this error: "ERROR: invalid byte sequence for encoding "UTF8": 0xd6 0x5f CONTEXT: unnamed portal parameter $1" (0xd6, 0x5f just for an instance). And i m not sure what causes it, but apparently this error comes from libpq call, but I cannot be sure about this. Here is code that represents my case:

#include <pqxx/pqxx>
#include <array>
#include <iostream>
#include <format>

inline constexpr std::size_t DataSize = 8;

using Data = std::array<std::byte, DataSize>;

namespace pqxx
{
    template <>
    struct string_traits<Data>
    {
        // static constexpr bool converts_to_string{true};
        static constexpr bool converts_from_string{true};

        static constexpr auto DataSize = internal::size_esc_bin(std::tuple_size_v<Data>);

        static Data from_string(std::string_view text)
        {
            if (text.size() != DataSize)
            {
                throw pqxx::conversion_error(std::format("invalid data size: {}. Expected: {}", text.size(), DataSize));
            }

            Data buf;
            pqxx::internal::unesc_bin(text, reinterpret_cast<std::byte *>(buf.data()));
            return buf;
        }

        static zview to_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("to_buf: unable to fit data into buffer with size {}", end - begin));
            }

            return generic_to_buf(begin, end, value);
        }

        static char *into_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("into_buf: unable to fit data into buffer with size {}", end - begin));
            }

            for (auto i = 0; i < value.size(); i++)
            {
                begin[i] = static_cast<char>(value.at(i));
            }
            begin[value.size()] = '\0';

            return begin + DataSize;
        }

        static size_t size_buffer(Data const &value) noexcept
        {
            return DataSize;
        }
    };
}

namespace std
{
    template <>
    struct formatter<Data> : formatter<string>
    {
        auto format(const Data &data, format_context &ctx) const
        {
            int tmpData[DataSize];
            std::transform(data.begin(), data.end(), tmpData, [](auto elem)
                           { return static_cast<int>(elem); });
            return formatter<string>::format(
                std::format("\\x{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}",
                            tmpData[0], tmpData[1], tmpData[2], tmpData[3],
                            tmpData[4], tmpData[5], tmpData[6], tmpData[7]),
                ctx);
        }
    };
}

inline constexpr std::byte operator""_bt(unsigned long long c)
{
    return static_cast<std::byte>(c);
}

auto main(int argc, const char *argv[]) -> int
{
    static constexpr std::string_view ConnectionQuerySchema = "host={} port={} dbname={} user={} password={}";
    static constexpr std::string_view InsertionSchema = "INSERT INTO test_table VALUES ('{}')";

    static constexpr Data okData{0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt};
    static constexpr Data badData{0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt};

    try
    {
        pqxx::connection con(std::format(ConnectionQuerySchema, argv[1], argv[2], argv[3], argv[4], argv[5]));
        con.prepare("insert", "INSERT INTO test_table VALUES ($1)");

        pqxx::nontransaction nt(con);
        nt.exec0("CREATE TABLE IF NOT EXISTS test_table(data bytea NOT NULL)");

        std::cout << "Ok: " << std::boolalpha << nt.exec_prepared0("insert", okData).empty() << std::endl;
        const auto query = std::format(InsertionSchema, badData);
        std::cout << "Query: " << query << " Also ok: " << nt.exec0(query).empty() << std::endl;
        std::cout << "Bad: " << nt.exec_prepared0("insert", badData).empty() << std::endl;
    }
    catch (const std::exception &e)
    {
        std::cerr << e.what() << '\n';
    }
} 

Compiled as follows: g++-13 -std=c++20 test.cpp -o test -lpqxx -lpq

Broollie avatar May 06 '24 14:05 Broollie

When you send prepared statement parameters with libpq, all parameters are converted once into strings. If you only write $1 in the SQL, PostgreSQL will treat it as a string and will try to interpret it as client encoding UTF-8, which will fail.

Try replacing $1 with $1::bytea.

tt4g avatar May 06 '24 21:05 tt4g

When you send prepared statement parameters with libpq, all parameters are converted once into strings. If you only write $1 in the SQL, PostgreSQL will treat it as a string and will try to interpret it as client encoding UTF-8, which will fail.

Try replacing $1 with $1::bytea.

Thank you for your suggestion, but I tried adding ::bytea earlier and it didn't work. But thanks to your explanation I ve got a soluction that does work. By passing input into pqxx::internal::esc_bin it will produce escaped string and will work correctly afterwards.

I would guess, that it also should not be too expensive by calling it this way: pqxx::internal::esc_bin({badData.begin(), badData.end()}) as it will create a view for an input argument. But I wonder, if there is a better solution.

Anyways, i will also include my solution there in case somebody will find it useful:

#include <pqxx/pqxx>
#include <iostream>
#include <format>

inline constexpr std::size_t DataSize = 8;

using Data = std::array<std::byte, DataSize>;

namespace pqxx
{
    template <>
    struct string_traits<Data>
    {
        // static constexpr bool converts_to_string{true};
        static constexpr bool converts_from_string{true};

        static constexpr auto DataSize = internal::size_esc_bin(std::tuple_size_v<Data>);

        static Data from_string(std::string_view text)
        {
            if (text.size() + 1 != DataSize) //size_esc_bin adds 1 byte for \0 but std::string_view::size doesn't include it
            {
                throw pqxx::conversion_error(std::format("invalid data size: {}. Expected: {}", text.size(), DataSize));
            }

            Data buf;
            pqxx::internal::unesc_bin(text, reinterpret_cast<std::byte *>(buf.data()));
            return buf;
        }

        static zview to_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("to_buf: unable to fit data into buffer with size {}", end - begin));
            }

            return generic_to_buf(begin, end, value);
        }

        static char *into_buf(char *begin, char *end, Data const &value)
        {
            if (end - begin < DataSize)
            {
                throw pqxx::conversion_overrun(std::format("into_buf: unable to fit data into buffer with size {}", end - begin));
            }

            for (auto i = 0; i < value.size(); i++)
            {
                begin[i] = static_cast<char>(value.at(i));
            }
            begin[value.size()] = '\0';

            return begin + DataSize;
        }

        static size_t size_buffer(Data const &value) noexcept
        {
            return DataSize;
        }
    };
}

namespace std
{
    template <>
    struct formatter<Data> : formatter<string>
    {
        auto format(const Data &data, format_context &ctx) const
        {
            int tmpData[DataSize];
            std::transform(data.begin(), data.end(), tmpData, [](auto elem)
                           { return static_cast<int>(elem); });
            return formatter<string>::format(
                std::format("{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}{:02x}",
                            tmpData[0], tmpData[1], tmpData[2], tmpData[3],
                            tmpData[4], tmpData[5], tmpData[6], tmpData[7]),
                ctx);
        }
    };
}

inline constexpr std::byte operator""_bt(unsigned long long c)
{
    return static_cast<std::byte>(c);
}

auto main(int argc, const char *argv[]) -> int
{
    static constexpr std::string_view ConnectionQuerySchema = "host={} port={} dbname={} user={} password={}";
    static constexpr std::string_view InsertionSchema = "INSERT INTO test_table VALUES ('\\x{}')";

    static constexpr Data okData{0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt, 0x01_bt};
    static constexpr Data badData{0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt, 0xd6_bt, 0x5f_bt};

    try
    {
        pqxx::connection con(std::format(ConnectionQuerySchema, argv[1], argv[2], argv[3], argv[4], argv[5]));
        con.prepare("insert", "INSERT INTO test_table VALUES ($1)");
        con.prepare("view", "SELECT * FROM test_table");

        pqxx::nontransaction nt(con);
        nt.exec0("CREATE TABLE IF NOT EXISTS test_table(data bytea NOT NULL)");

        std::cout << "Ok: " << std::boolalpha << nt.exec_prepared0("insert", okData).empty() << std::endl;
        const auto query = std::format(InsertionSchema, badData);
        std::cout << "Query: " << query << " Also ok: " << nt.exec0(query).empty() << std::endl;
        std::cout << "Bad: " << nt.exec_prepared0("insert", pqxx::internal::esc_bin({badData.begin(), badData.end()})).empty() << std::endl;

        const auto result = nt.exec_prepared("view");

        for(auto row : result)
        {
            auto [data] = std::move(row.as<Data>());
            std::cout << std::format("{}", data) << std::endl; 
        }
    }
    catch (const std::exception &e)
    {
        std::cerr << e.what() << '\n';
    }
}

Broollie avatar May 07 '24 10:05 Broollie

Glad It's solved. I should have noticed earlier that unesc_bin() was being called and not esc_bin().

tt4g avatar May 07 '24 13:05 tt4g

Oops, and I'm only just trying to catch up! Thanks once again @tt4g for jumping in.

@Broollie I did notice this one little thing:

        static Data from_string(std::string_view text)
        {
            if (text.size() != DataSize)
            {
                throw pqxx::conversion_error(std::format("invalid data size: {}. Expected: {}", text.size(), DataSize));
            }
            // ...
        }

It looks to me here as if you're using DataSize for two very different quantities: on the one hand, it's the number of binary bytes in a Data value, but on the other, it's also the number of text bytes you expect in the SQL representation of a Data value. Those are two very different things!

jtv avatar May 07 '24 18:05 jtv

(I took the liberty to edit the code snippets you posted above: when you add "cxx" after the three backticks at the beginning, you get C++ syntax highlighting. :-)

jtv avatar May 07 '24 18:05 jtv

And I must admit that the mechanism by which libpq figures out the types of the parameters is mysterious to me as well.

By the way if you're working in C++20 @Broollie, do you actually need your own string_traits implementation? I think Data satisfies the binary concept defined in include/pqxx/strconv.hxx so you should have a conversion out of the box. ISTM you should be able to just use that, or if you want the size check in every conversion, then you could at least delegate to it.

Generally it's not a good idea to rely on items from the pqxx::internal:: namespace, because I reserve the right to change those on a whim. In particular, libpqxx 8 will require C++20 or better, and I'm hoping to replace a lot of these functions with ones that use std::span and what have you.

jtv avatar May 07 '24 18:05 jtv

Oops, and I'm only just trying to catch up! Thanks once again @tt4g for jumping in.

It looks to me here as if you're using DataSize for two very different quantities: on the one hand, it's the number of binary bytes in a Data value, but on the other, it's also the number of text bytes you expect in the SQL representation of a Data value. Those are two very different things!

Hey, it s ok, i am myself always late :D Regarding DataSize, ye, that was my mistake on naming constants, didn't notice that I already have DataSize. It should be more like this

// ...
inline constexpr std::size_t DataSize = 8;
struct string_traits<Data>
{
    static constexpr auto EscDataSize = internal::size_esc_bin(std::tuple_size_v<Data>);
    // ...

By the way if you're working in C++20 @Broollie, do you actually need your own string_traits implementation? I think Data satisfies the binary concept defined in include/pqxx/strconv.hxx so you should have a conversion out of the box. ISTM you should be able to just use that, or if you want the size check in every conversion, then you could at least delegate to it.

That s actually what made me attempt to do it in the first place, because std::array satisfies std::ranges::contiguous_range and so std::array<std::byte, N> should be usable for this. But it does require to define string_traits, specifically because it cannot resolve string_traits<TYPE>::from_string(text):

In file included from /home/linuxbrew/.linuxbrew/include/pqxx/internal/concat.hxx:7,
                 from /home/linuxbrew/.linuxbrew/include/pqxx/connection.hxx:36,
                 from /home/linuxbrew/.linuxbrew/include/pqxx/array.hxx:26,
                 from /home/linuxbrew/.linuxbrew/include/pqxx/pqxx:4,
                 from test.cpp:1:
/home/linuxbrew/.linuxbrew/include/pqxx/strconv.hxx: In instantiation of 'TYPE pqxx::from_string(std::string_view) [with TYPE = std::array<std::byte, 8>; std::string_view = std::basic_string_view<char>]':
/home/linuxbrew/.linuxbrew/include/pqxx/field.hxx:230:28:   required from 'T pqxx::field::as() const [with T = std::array<std::byte, 8>]'
/home/linuxbrew/.linuxbrew/include/pqxx/row.hxx:284:65:   required from 'auto pqxx::row::get_field() const [with TUPLE = std::tuple<std::array<std::byte, 8> >; long unsigned int index = 0]'
/home/linuxbrew/.linuxbrew/include/pqxx/row.hxx:278:53:   required from 'auto pqxx::row::get_tuple(std::index_sequence<__indices ...>) const [with TUPLE = std::tuple<std::array<std::byte, 8> >; long unsigned int ...indexes = {0}; std::index_sequence<__indices ...> = std::integer_sequence<long unsigned int, 0>]'
/home/linuxbrew/.linuxbrew/include/pqxx/row.hxx:192:42:   required from 'std::tuple<_UTypes ...> pqxx::row::as() const [with TYPE = {std::array<std::byte, 8>}]'
test.cpp:114:43:   required from here
/home/linuxbrew/.linuxbrew/include/pqxx/strconv.hxx:441:42: error: 'from_string' is not a member of 'pqxx::string_traits<std::array<std::byte, 8> >'
  441 |   return string_traits<TYPE>::from_string(text);

And it s not just pqxx::row::as(), but pqxx::row::reference::get<>() as well.

Generally it's not a good idea to rely on items from the pqxx::internal:: namespace, because I reserve the right to change those on a whim. In particular, libpqxx 8 will require C++20 or better, and I'm hoping to replace a lot of these functions with ones that use std::span and what have you.

Ye, looking through docs once again, i noticed convenient pqxx::binary_cast that not only works in my case, but also produces a view, so no copies (unlike pqxx::internal::esc_bin, which produces std::string)

Broollie avatar May 08 '24 10:05 Broollie

Upon further investigation, i found out that PQXX_HAVE_CONCEPTS was not defined, so now I wonder if there are other defines, that may be missing and, if defined, should enable direct use of std::array<std::byte, N>

Broollie avatar May 08 '24 11:05 Broollie

The configure or cmake step to the libpqxx build should detect whether your compiler supports concepts.

I'm writing this on my phone where it's hard to look up past conversation... Did you select C++20 (or newer) when configuring the build? Because if so, I would definitely have expected it to detect concepts support.

By the way beware that esc_bin and binary_cast do entirely different things! The former converts binary data to SQL text format. The latter merely creates a view from a pointer, without converting.

jtv avatar May 08 '24 12:05 jtv

@Broollie when building libpqxx, did you use configure or cmake? If you used configure then the file config.log should show whether the configure script found __cpp_concepts defined or not. (And just in case: what matters is the C++ version you selected when configuring the build, i.e. when running either configure or cmake.)

jtv avatar May 08 '24 12:05 jtv

@jtv Ok, i ve figured out what s going on. Mostly. So, basically in my main project i build libpqxx with cmake and all defines are present there. But for this test I use libpqxx installed with brew package manager. And defines are missing there because by default libpqxx v7.9 uses CMAKE_CXX_STANDARD 17. But it is not as important. So the reason why I can't use std::array<std::byte, N> directly is that string_traits already has implementation for std::array and it doesn't have from_string() defined. I checked if you can derive from internal::array_string_traits<> and simply implement from_string() and it does work in my case

Broollie avatar May 08 '24 14:05 Broollie

But it also seems like I m missing something, because direct call to pqxx::string_traits<Data>::size_buffer(OkData) results in linkage error for pqxx::oops_forbidden_conversion<std::byte>(), which is not a great sign

Broollie avatar May 08 '24 15:05 Broollie

By the way beware that esc_bin and binary_cast do entirely different things! The former converts binary data to SQL text format. The latter merely creates a view from a pointer, without converting.

Oh, and to clarify, I saw, that binary_cast simply creates a std::basic_string_view<std::byte> without escaping data and since its behavior is defined, it should work just fine. Seems to be correct

Broollie avatar May 08 '24 15:05 Broollie

Ah yes, the forbidden conversion - I guess that's two concepts clashing.

A span of contiguous std::byte should act as a bytea. And I was expecting a std::array of std::byte to act as one of those.

But more generically, a std::array of any other type should convert as a bunch of individual elements, wrapped in SQL array syntax. But that won't work for std::byte because by design there is no conversion for std::byte.

So we'll need a way to tell the compiler that we want the former in this case, not the latter.

jtv avatar May 08 '24 19:05 jtv

Or, given that the most specific specialisation should normally win out, perhaps it's a matter of removing a string_traits specialisation that has now become too specific.

jtv avatar May 08 '24 20:05 jtv

That is indeed a clash of 2 while not completely different, but still contradicting concepts. There are enough ways of resolving this conflict while keeping backward compatibility (or you can ignore it in favor of implementing changes for 8.x version and some changes in traits API, if there are any). I would guess it may be solved either by creating more specialisations (like having different implementations for std::array<T, N> and std::array<std::byte, N>), that would make string_traits even more specific, or by going other way and creating more generic, yet more flexible traits API. But those are extremes of both sides, so a solution may lie between them

Broollie avatar May 08 '24 20:05 Broollie

@Broollie for now I'm just utterly stuck on getting readthedocs working again. I'm rebuilding that from scratch, because I see no other way at this point. Hopefully after that I can focus on actual progress again.

jtv avatar May 21 '24 09:05 jtv

@jtv sure, we can just keep the issue open so that you can return to it once you r done

Broollie avatar May 21 '24 14:05 Broollie

The good news is that I got the documentation build working again. (With lots of scary warning messages, so I'll be chipping away at that pile still.) I've got one other small feature to build, and then hopefully I can swap this ticket back into my personal memory and see what I can do about it.

jtv avatar Jun 06 '24 09:06 jtv

Reading this back again, I think what I need to do about this ticket is write a test that does conversions of a std::array<std::byte, ...>... and then make that test work.

Of course that's a whole different ballgame in C++20 than it is in C++17, because I'll be counting on C++20 soon. It's not as easy as "make sure it works in C++20, we don't care about C++17."

So I'm planning to keep this ticket on the shelf until then. Right now I've got a really nice idea of where to take the exec functions, and I think that would be a good cutoff point to start 8.0 development.

jtv avatar Aug 07 '24 20:08 jtv

Seems to be a reasonable test case and perhaps there should be a test for a std::span<std::byte, ...> as well. Godspeed on the v8.0

Broollie avatar Aug 15 '24 09:08 Broollie

Thanks!

jtv avatar Aug 15 '24 11:08 jtv

There has been no activity on this ticket. Consider closing it.

github-actions[bot] avatar Oct 15 '24 02:10 github-actions[bot]