glaze icon indicating copy to clipboard operation
glaze copied to clipboard

Evaluate viability of reducing input buffer requirements

Open justend29 opened this issue 1 year ago • 11 comments

Existing input buffers

The input parameter to every Glaze function that parses input from a range, as opposed to iterators, requires the input buffer to be a contiguous range (example). These functions dispatch other glaze functions that parse input from iterators, which therefore expect these iterators to lie within a single contiguous buffer, too.

The majority of operations on the input iterator would be satisfied by a multi-pass iterator, and sometimes even forward iterators. Nevertheless, for iterators meeting only these concepts, various parsing operations would be much slower. For example, the use of std::distance(), including to construct glz::parse_errors' location index.

Limitations

With this premise, there seem to be two limitations:

  1. The entire input buffer must be available in memory before parsing i. this is a regular limitation of DOM-based parsers instead of streaming parsers, like SAX parsers.
  2. The entire input must be in a contiguous buffer for parsing.

Now, from the outset, it may seem like a niche case to not have the entire input in a contiguous buffer. However, it's extraordinarily common in networking to receive data in smaller chunks. In certain cases, clients aren't aware of the total response size, and read data one chunk at a time (streaming). Even when the response body is of a known size, reading the body from IO buffers is invariably accomplished through a series of smaller reads using buffer(s) of known size. The results of these reads can be either handled as they're received to reuse a single buffer, or buffered until all the smaller reads are done. The latter case is commonly buffered into a (non-contiguous buffer, since reallocating the entire buffer and copying the data is slow. asio's coins buffers of this style "buffer sequences", but this isn't exclusive to asio.

The technique of reading huge/unsized blobs one buffer/page at a time is a technique that extends to any IO operation, including reading files. With files, there are more utilities for the operating system to abstract the paging behind system calls, but that disappears with asynchronous IO.

Consideration

So, this ticket is simply to evaluate the viability of reducing Glaze's requirements on input buffers. Maybe certain pieces of the library may need to be changed, possibly changing glz::parse_error::location to an iterator, but maybe not.

Given C++'s concepts around iterators and ranges, should the input buffer requirements be weakened, both of the above limitations can be lifted, as smart input iterators can be used to satisfy the input requirements of Glaze while actually performing complex retrieval under the hood. An example of what I'm trying to describe without any actual networking:

#include <forward_list>
#include <iostream>
#include <ranges>
#include <string>

using Linked_buffer = std::forward_list<std::string>;

template <std::ranges::input_range Input_buffer>
void parse_like_glaze(Input_buffer&& buffer) {
    // universal reference to buffer necessary because:
    // - r-value input ranges are acceptable
    // - input ranges mutate lazily -> no const& buffer
    for (const auto parse_char : buffer) {
        std::cout << "parsing JSON char '" << parse_char << "'\n";
    }
    std::cout << std::endl;
}

int main() {
    // 1. current situation, where all data is stored contiguously
    const std::string_view contiguous_data{
        R"({
            "name": "bob",
            "values": [123,456, 789]
        })"};

    // 2. limitation 1, where all data is stored, but in a linked buffer
    const Linked_buffer buffered_data{{R"( { "name": "bob )"},
                                      {R"( ", "values": [123,4 )"},
                                      {R"( "56, 789 ] )"},
                                      {R"( } )"}};

    // 3. limitation 2, where data is handled 1 chunk at a time, not stored
    Linked_buffer pretend_this_is_network_data{{R"( { "name": "bob )"},
                                               {R"( ", "values": [123,4 )"},
                                               {R"( "56, 789 ] )"},
                                               {R"( } )"}};

    auto buffer_chunk = [current_buffer = std::string{},
                         next_buffer =
                             std::string{}](auto&& chunk_from_wire) mutable {
        // await
        std::swap(current_buffer, next_buffer);
        next_buffer = std::move(chunk_from_wire);  // spawn
        return current_buffer;
    };

    auto chunked_data = pretend_this_is_network_data |
                        std::views::transform(buffer_chunk) | std::views::join;

    // ranges satisfy Glaze's input requirements equaly
    // smart input iterators easily formed with C++20 ranges
    parse_like_glaze(contiguous_data);
    parse_like_glaze(std::views::join(buffered_data));
    parse_like_glaze(chunked_data);
}

justend29 avatar Oct 28 '23 17:10 justend29

Case 3, where the input isn't stored in memory in its entirety, would cause code like this to fail. Swapping buffers invalidates any iterators, and the start iterator in that snippet may be in a separate buffer from the closing quote.

This extends to any code that doesn't handle characters as they're parsed since any previous character could be removed at any time. This includes parsing to string views.

justend29 avatar Oct 29 '23 23:10 justend29

And, an example of current code that requires contiguous input buffers: https://github.com/stephenberry/glaze/blob/main/include/glaze/json/read.hpp#L374

In fact, parsing to a string view from above is another example.

justend29 avatar Oct 30 '23 00:10 justend29

@justend29 do you know whether it is possible to write a concept to detect if iterator is pointing to contiguous buffer or non contiguous buffer?

jbbjarnason avatar Oct 30 '23 08:10 jbbjarnason

@jbbjarnason there are iterator traits, where there's a contiguous tag that can be detected from the iterator_categery trait. The C++ standard provides a concept for this type trait, too.

justend29 avatar Oct 30 '23 10:10 justend29

In terms of implementation, I think the following would solve this at compile time:

struct glz::meta<std::string_view> {
  static constexpr auto requires_contiguous_buffer{ true };
  // ...
};

And comparing the requires_contiguous_buffer with the contiguous tag.

But I think this tag is unnecessary, but it is explicit.

jbbjarnason avatar Oct 30 '23 11:10 jbbjarnason

To restrict parsing string_views, the concept std::contiguous_iterator could just be employed on the input iterators of the from_json<string_view_t> specialization. String_views don't go through the meta system to generate from_json. There is instead a specialization of from_json directly for string views.

Nevertheless, there is commonly reused code like this that returns a string_view peering into the input buffer. Simply omitting it when the input isn't contiguous would essentially render no parsing ability.

As you say, parsing directly to a string_view would have to be omitted without contiguous input, but for functions like the above, maybe there's a way to abstract the result away from a contiguous view. This could be reused throughout. The result could then be dependent on the iterator categories provided to the function: a string_view when contiguous, an allocated buffer otherwise, and maybe a different type-specific view when possible.

Realistically, though, allocating may be slower than just copying the non-contiguous buffer to a contiguous one before parsing. In such a case, would it be reasonable for Glaze to offer a specific buffer type targeted only for a common case, like #2? Then, views could be efficiently constructed without allocated. Maybe there's a way to do that generically, where the view type can be derived and used instead of limiting to a special input buffer type.

justend29 avatar Oct 30 '23 12:10 justend29

Thanks for this helpful discussion on streaming buffers. It's definitely something that should be added, but has a number of complexities, which you have noted.

I will note that in a lot of contexts, reading the entire message into a contiguous buffer and then parsing is faster than the streaming solution. The issue is for enormous files, especially when they get too big to fit in RAM. But, as you point out, there are many networking situations where only a chunk is received at a time. This chunk is often contiguous, but only part of the full message. I think if we focus on a performant solution, it should be reading these chunks of contiguous data, like you've suggested above.

I think using concepts like std::contiguous_iterator is a good way to go. But, like you pointed out, it's probably best to just use contiguous chunks of data and not support streaming directly. The problem is that we don't want to have to check for chunk boundaries everywhere. We don't want to have to be checking if we need to switch buffers within an algorithm to parse a number.

Can we stop our parsing at reasonable delimiters, like commas? If we require buffer chunks to be delimited at commas (,), then we can know that we are contiguous within parsing algorithms like numbers. This will significantly simplify the code and improve performance.

I think we will want a set of supported chunk terminating characters, e.g. , " ] and }. We can quickly verify that our buffer chunks end with one of these characters and error if otherwise.

For massive strings I think we could just require that specific chunk to support the entire string. So, this would only be a limitation if a single string in the JSON were gigabytes in size, but that is not a common use case. It's much more common for gigabyte JSON files to have many smaller values.

With this approach we can provide some chunk separation code for the general streaming case, where we will have to do some copying to get the chunks to terminate at the supported delimiters (, " ] and }), but for users who are writing their own network code and using libraries like asio, they can write logic that will use these terminating characters and get very high performance.

Those are my initial thoughts. I'll continue to mull over this and consider how to best start tackling streaming.

stephenberry avatar Oct 30 '23 13:10 stephenberry

Adding my comments from a pull request to this discussion:

I often look at assembly with compiler explorer and benchmark using quick bench. Typically using std::memcpy is the fastest approach (especially with known sizes), because you tell the compiler to just consider the raw bytes and have to handle the distances precisely and don't require a check on each iteration. std::memcmp, std::memset, and std::memmove are also very hard to beat. In this case, I moved to std::memmove because we have no need to increment the buffer iterator, which would be necessary in the old for loop. This lets the compiler drop some instructions.

For low-level string handling I've found these significantly faster than standard algorithms, because standard algorithms work on characters at a time and we can often use chunks of memory. For larger types, like integers or floats, or for structs, the standard algorithms are fantastic and hard to beat.

As this applies to non-contiguous buffers, I don't see a problem, because I intend to require local contiguous chunks to work on. It is terrible for performance if you have to check each byte individually for things like key parsing, number parsing, etc.

I had posted on the streaming issue that I intend to use JSON separators to delineate contiguous chunks. We can just stream into a small buffer until we hit one of these separators (e.g. ,, :...). Our only restriction is then to limit the max length of strings that we support. But, most JSON documents don't contain a single string that is gigabytes in size. Typically the combined document is gigabytes with various objects and values.

So, to sum things up, we want to dump the stream into a small local buffer and parse that, and we just parse the next chunk when we reach a separator that would delineate values like numbers, strings, etc.

stephenberry avatar Dec 01 '23 13:12 stephenberry

I believe the null termination requirements of glaze would require every chunk to be null-terminated.

justend29 avatar Dec 06 '23 19:12 justend29

I think we want to parse from a contiguous buffer anyway, so I don't think the null-termination requirement will be an issue. But, good to note.

stephenberry avatar Dec 06 '23 19:12 stephenberry

Seems relevant to this discussion: C++23: The rise of new streams

stephenberry avatar Dec 08 '23 15:12 stephenberry