glaze icon indicating copy to clipboard operation
glaze copied to clipboard

data_must_be_null_terminated issue

Open shiretu opened this issue 1 year ago • 22 comments

Hi everyone,

This issue is significantly impacting performance since we are required to create copies of the buffer we want to parse to ensure it is null-terminated.

For instance, I have an ultra-fast 0-copy HTTP(S) client. When data reaches the application layer, I end up with several std::string_view instances corresponding to different parts: HTTP header field names, HTTP header field values, HTTP payload, etc. These are all mapped onto the buffer managed by the previous protocol in the stack, which is TLS.

Due to this restriction, and given that TLS buffers are not null-terminated (as is common in networking), I am now compelled to create large data copies instead of directly parsing the std::string_view, which already has the correct size.

Is there a way to remove this restriction?

shiretu avatar Jun 05 '24 19:06 shiretu

Thanks for bringing up this issue and explaining your use case.

By requiring null termination we can greatly reduce the number of checks and branches, which makes parsing around 5% faster. When Glaze can use padding on your buffer (by using a non-const std::string) this performance is increased even more. The reduction in parsing time due to null termination and padding can be more than the time it takes to memcpy the buffer into an intermediate. I just say this to explain the design motivation and express how copying the data in your case could be faster than implementing a version that supports non-null termination.

But, I do agree that it is best to avoid copies and allocations, so this is a very valid concern. Plus, I'm sure your code would be simplified by not requiring these copies.

I'll have to think about this more. Could you provide an example structure that you want to parse and how it is being handled for 0-copy? I'm curious as to how TLS is related to JSON.

I'm very interested because I'm currently working on HTTP/websocket code and working Glaze into it.

stephenberry avatar Jun 05 '24 22:06 stephenberry

The underlying buffer simply needs to contain a null character within its allocated memory somewhere after the message. So, intermediate string_views do not need null termination. Right now Glaze checks input buffer std::string_views for null termination. I’ll add an option to turn off this checking by the developer.

stephenberry avatar Jun 09 '24 12:06 stephenberry

Some gore details in my env: In my case, the TLS is just another filter (in my local terminology) in the chain. This filter is also 0-copy up to it (from the wire) and from it (towards the app layer). Itself, it is not 0 copy, as it has to decrypt/encrypt.

But the rest of the chain on top of it is 0-copy. (except when it gets ziped :) )

Okay, you got the idea, there are filters in the chain that requires copying and massive alteration/mutation of the data. It is their nature. However, JSON parsing should be kind of a read-only thing.

I can not add anything to the buffers, all I can do is read from them or copy them. Reason is that the filters are not aware of each other. They just know they have neighbours and they stimulate them with data (inbound and outbound). It makes things super fast and also encourages metaprogramming (no polymorphism in my case).

My humble and unsolicited recommendation for glaze library is to just accept any kind of input as long as it has std::data() and std::size() overloaded for that type. This way there is no need for custom types to describe the input.

std::string, std::string_view, std::vector<>, std::array<>, etc they all have that overloaded.

My custom buffer used inside the chain also has that. It is the most generic and friction-less way of accepting a chunk of memory for processing.

shiretu avatar Jun 12 '24 06:06 shiretu

At the heart of walking sits 2 logical operations:

  1. advance
  2. compare something with something to see if we reached the end

Currently glaze probably does a pointer increment and then compares the dereferencing of that pointer with \0.

The suggestion I'm making is to keep the pointer advance as is, but compare it un-dereferenced with another end pointer which is computed one-time by doing it like this: (pseudo code)

const char *pCursor=std::data(input);
const char *pEnd=pCursor+std::size(input);
while((pCursor++)!=pEnd){
 //do stuff
}

shiretu avatar Jun 12 '24 06:06 shiretu

If data is null terminated then when we need to check for the existence of a character, let's say a {, then we can simply do if (*pCursor == '{'), but if we need to check against an end, then we need to do if (pCursor != pEnd && *pCursor == '{')

Notice that we now have two comparisons and another boolean operation for every single character that we need to check. I'm just explaining this so you can see the performance reason for requiring null termination.

stephenberry avatar Jun 19 '24 22:06 stephenberry

I plan to solve this problem in a few different ways for different use cases. One quick question I have for you is would you be able to assume the input JSON is valid? If we can assume the input JSON is valid we can make a more efficient solution.

I do plan on adding support for unknown JSON that could have errors without null termination, but this will take a bit more time. But, I'll keep this issue open until I've added the support.

stephenberry avatar Jun 19 '24 22:06 stephenberry

I have the same problem with binary transactions. @stephenberry

#include <cstddef>
#include <glaze/glaze.hpp>
#include <iostream>

int main()
{
   // Example Users
   std::string filePath = "/Users/meftunca/Desktop/big_projects/son/users.beve";
   glz::json_t users = glz::json_t::array_t{glz::json_t{{"id", "58018063-ce5a-4fa7-adfd-327eb2e2d9a5"},
                                                        {"email", "[email protected]"},
                                                        {"password", "@cWLwgM#Knalxeb"},
                                                        {"city", "Sayre ville"},
                                                        {"streetAddress", "1716 Harriet Alley"}}};

   std::vector<std::byte> bytes;
   bytes.shrink_to_fit();
   auto ec = glz::write_file_binary(users, filePath, bytes); // Write to file
   if (ec) {
      std::cerr << "Error: " << glz::format_error(ec) << std::endl;
      return 1;
   }

   glz::json_t user2;
   std::vector<std::byte> bytes2;
   ec = glz::read_file_binary(user2, filePath, bytes2);
   if (ec) {
      std::cerr << "Error: " << glz::format_error(ec) << std::endl;
      return 1;
   }

   std::cout << user2[0]["id"].get<std::string>() << std::endl;
   std::cout << user2[0]["email"].get<std::string>() << std::endl;
   std::cout << user2[0]["password"].get<std::string>() << std::endl;
   std::cout << user2[0]["city"].get<std::string>() << std::endl;
   std::cout << user2[0]["streetAddress"].get<std::string>() << std::endl;

   return 0;
}

Sorry, reading worked fine with std::string.

meftunca avatar Jun 20 '24 13:06 meftunca

If data is null terminated then when we need to check for the existence of a character, let's say a {, then we can simply do if (*pCursor == '{'), but if we need to check against an end, then we need to do if (pCursor != pEnd && *pCursor == '{')

This is quite nice example of dilemma what is better .., making some minor or major optimizations in library and requiring from user to null terminate or don't force user and just use generic 'sentinel' approach. I have spent quite a lot of time to write null termination independent stralgo to get rid of that limitation and in most cases 'false' optimization. It is often false as it is requiring to as mentioned here allocate from heap memory and create deep copy of sub strings and null terminate them to get small improvement in some condition.

Designers of std::ranges/std::views already solved this issue by using generic approach of sentinel_for<> .. It is Up to user how does sentinel looks like in given use case. And source range often it is enough to be just forward range and not required to be contiguous range like string_view .. In my experience of using high load services most important optimization is to not allocate from heap at a cost of additional condition check, as checking condition is often done on data in cache line of cpu and allocation causes global sync event. With sentinel approach user can pass sentinel pointing at end of view or assumin always true() And glaze can then use sentinel composed from user sentinel searching for something .. So the decision of optimizing null termination can be postponed to user, depending what is he passing forward range, string_view or null terminated std::string

arturbac avatar Jul 22 '24 12:07 arturbac

This is a high priority, I've just been distracted by other issues and development needs for my primary job.

The null termination requirement is currently used for a few reasons:

  • Performance, as noted in avoiding end checks
  • Additional buffer space to optimize reading based on this extra padded byte
  • Performance due to error safety knowing that a null character must exist. This affects core parsing algorithms which take this into account for handling invalid inputs. In the sentinel case these optimizations need to be refactored, which is why this change will take a bit more effort.

Designers of std::ranges/std::views already solved this issue by using generic approach of sentinel_for<>...

Yes, the solution is to allow sentinel selection in Glaze. This will allow the user to indicate that the sentinel will indeed exist and therefore Glaze can assume safety.

stephenberry avatar Jul 22 '24 13:07 stephenberry

Designers of std::ranges/std::views already solved this issue by using generic approach of sentinel_for<>...

Yes, the solution is to allow sentinel selection in Glaze. This will allow the user to indicate that the sentinel will indeed exist and therefore Glaze can assume safety.

IMHO Glaze shouldn't assume anything, give responsibility to user by the fact of type provided as source range to parse...

a) Glaze should use user provided sentinel to not read overflow. PPl will benefit in such cases where allocating string costs much more than checking end view condition. b) Or glaze can determine sentinel based on source range, for input range or string_view use sentinel of view.end() for null terminated std::string assume null termination. Thus You can make composed search sentinel from that in first case it will search for 'X' and use end() for range end check, with string it will just search for 'X' as std::string is null_termianted.

I have a lot of cases in production code/services where not allocating string and using string_view with additional end() condition check is much faster than null termination assumption and buffers allocation.

arturbac avatar Jul 22 '24 13:07 arturbac

I agree in some cases to using type deduction. However, we don't want std::string_view parsing to be slower just because it is a string_view, especially when it is common for string views to have null characters or additional characters at the end. The thing is that we can optimize for dereferencing the end() if it is null terminated (especially true when using SWAR). However, if the string_view is not null terminated then we cannot dereference this pointer.

In many cases Glaze algorithms just care about the additional character and Glaze doesn't care whether it is a null character or not. But, I've just phrased the rule as "requires null termination" in order to make things simpler for users.

I have a lot of cases in production code/services where not allocating string and using string_view with additional end() condition check is much faster than null termination assumption and buffers allocation.

Yes, I understand there are massive benefits to this feature. It is coming.

stephenberry avatar Jul 22 '24 14:07 stephenberry

especially when it is common for string views to have null characters or additional characters at the end.

Just for clarity: null termination has nothing to do with string_view which is by default non null terminated sequence of characters ending at data+size. std::basic_string_view is not C string char const * and it is not convertible to it by design.

It is just coincidence that some sub group of string_views constructed from std::string or string literals has null termination when they cover whole C string representation.

arturbac avatar Jul 22 '24 17:07 arturbac

Definitely, but what I'm expressing is that a lot of Glaze optimizations can apply to std::string_view without a requirement of a null termination, but just by having a single additional character allocated at end, which is very common because string_views are often either an entire string or a subset of the string.

stephenberry avatar Jul 22 '24 17:07 stephenberry

IMHO this is very dangerous to relay on some '\0' in some longer string to which string_view is pointing while searching for ex matching } of some opening {. With invalid syntax You will likely start working in buffer overflow area. simple example:

auto a "{..) { ...} "
string_view b{a, 4};

so in normal circumstances You would parse only "{..)" string with syntax error result and return pointer pass by ) as end of parsing. with string view You will violate allowed buffer range and You would find matching } in out of buffer area and return pointer pass } , far away after string view end sentinel.

arturbac avatar Jul 22 '24 21:07 arturbac

No, because Glaze maps to C++ objects you would get an error in parsing well before you hit the null termination character.

I need to explain in more detail, but there are a variety of way we can determine if an error has occurred versus just looking at a sentinel. But, in cases where we do need to look for the end iterator, Glaze does that. There are just some optimizations that need to be tweaked for when a null character cannot be assumed.

stephenberry avatar Jul 22 '24 21:07 stephenberry

Glaze currently does end iterator checks all over in parsing, so it will use the end iterator of your string_view in most cases. There are just some cases that need to be handled.

stephenberry avatar Jul 22 '24 21:07 stephenberry

Glaze will never make logical decisions based on the extended buffer in a string_view, it is just that we can do SIMD optimizations if a buffer exists beyond the length of the string view, in some cases with just a single character.

stephenberry avatar Jul 22 '24 21:07 stephenberry

I've opened a pull request here #1203 that adds an is_null_terminated option that can be set to false for when the input buffer is not null terminated. There's a good bit of work to still do, but you can see the kinds of changes that are necessary.

stephenberry avatar Jul 22 '24 22:07 stephenberry

In thinking about this more I don't want to require users to set an option to handle null terminated buffers. I think it should be the default behavior to handle non-null terminated buffers. And, I've realized this is achievable without a significant performance loss by using an approach that I will call contextual sentinels. Because we know the types of what we are parsing, we can use these to verify that terminating characters match the decoded type, we can then shift the end iterator one place sooner so that the end iterator is always allocated. We then mark the sentinel along the error path so that we can short circuit returns. We check this context error for the sentinel. We handle improper closing due to sub-object termination at end by not allowing the sentinel context error to be set twice, which would produce an actual error.

This approach will give us nearly the same performance as before without requiring end sentinel checks everywhere, because we will contextually know the terminating character.

We'll then have an opt-in option for even faster parsing when we know data is null terminated, and we can turn this on by default for std::string.

The only negative side effect I see is that this approach will not support trailing whitespace when decoding. The buffer will need to be null terminated to support trailing whitespace. But, I think this is an okay limitation and in the future a trailing whitespace version with additional logic can be added.

stephenberry avatar Jul 23 '24 14:07 stephenberry

To add more thoughts for the sake of posterity:

The challenge without null termination is error handling. This is particularly true because Glaze does not use exceptions. If Glaze used exceptions we could just jump when an invalid end was reached, but without exceptions we need to walk our way back up the stack while ensuring no parent function tries to dereference the iterator. This means that for non-null terminated buffers we would need an additional iterator check after every single function call in Glaze. On top of this, every time we access a raw iterator to check something like *it == '{' we would need an additional end check. And, for every single whitespace character read we would need to check end iterators. This can massively degrade performance and makes the code much uglier.

Using contextual sentinels and returning through the error path solves these issues. Because Glaze doesn't use exceptions, the error path is extremely efficient, and we can use it both for errors on termination and also for valid termination. This means that we can utilize the current error checking mechanisms to get back up the stack without adding end iterator checks everywhere. It becomes multi-purposed and results in much faster and cleaner code. Another benefit of contextual sentinels is that we are checking the terminating character of the JSON value up front. This means that we can catch termination issues before parsing anything and handle these error cases extremely fast.

The downsides of contextual sentinels are two fold:

  1. They will not allow trailing whitespace, because we need to verify a valid terminating character. This is really a minor loss, because in most cases of using views we only want to parse the content of interest are are concerned with performance that doesn't care about trailing whitespace.
  2. Unnoticed partial reading for invalid JSON with sub-object (or sub-array) termination.

This second downside has a bit of complexity to it. If the input JSON is valid we have no issue. If the input JSON is invalid then we can easily handle pretty much every error as before. The one error that we no longer handle is partial reading due to sub-object termination. This means that we have JSON like: "{{"a":0}". Notice that the final trailing } is missing. This would not be seen as an error without depth-counting. The reason is because we've taken the error route to exit back up the stack, so the checks for the terminating } are skipped.

The simple solution is to add depth-counting. Meaning that we keep track of the opening and closing braces and brackets to ensure that everything was properly closed. We already need to do this to avoid stack overflows with nested variants, so I think this is a very reasonable solution.

stephenberry avatar Jul 24 '24 14:07 stephenberry

Work on contextual sentinels is now active here: #1213

stephenberry avatar Jul 24 '24 16:07 stephenberry

After getting an implementation working for most of Glaze with contextual sentinels in #1213 I'm tempted to take the more flexible, general solution, even though it will be slower.

I don't like the added complexity of needing to terminate with ] or } (support for , and " could be added). I'm sure users would just like to pass any valid JSON string_view and parse this.

A major decision needs to be made on whether the end iterator checks are made pre-increment or post-increment. It is valid to increment beyond the end so long as no dereference is made. A post-increment check has the downside of needing checks on the current iterator against the end iterator, which is what contextual sentinels was trying to solve. We can solve this problem by using a pre-increment check and error if we would increment to or beyond the end iterator. The downside to a pre-increment check is that we may reach the end iterator in a valid parse, which would incorrectly set an error state. But, just like contextual sentinels used the error path for valid end conditions, we can do the same with pre-increment checks that hit valid end conditions that need to walk back up the stack.

stephenberry avatar Aug 08 '24 11:08 stephenberry

I'm about to merge in initial non-null terminated buffer support for JSON reading (#1262). This uses a general approach rather than contextual sentinels, because I decided that it was best to just let users throw in any std::string_view and not have to worry about the terminating sentinel.

When doing partial reads, a non-null terminated buffer will not give an error if a full object or array isn't parsed, this makes sense because it's a partial read. But, for normal reading of objects and arrays, closing characters are validated.

The default behavior is still to require null-terminated buffers as input when reading JSON, however the option null_terminated can be set to false, which will then check against the end iterator to provide safety.

A lot of tests are being performed on the non-null terminated path. Fuzz testing was added to generic JSON reading and an entire copy of the json_test with over 9,000 lines of test code is being run with null_terminated = false. Invalid input tests were also added for non-null terminated buffers. That said, this is a new feature and prone to bugs.

For performance reasons if you can mutate your view buffer and add a null terminating character and then revert this change post-parse, this will provide the best performance. But, if you don't want to be bothered or can't mutate your input view, then this null-terminated support will only have around a 20% performance loss.

stephenberry avatar Aug 12 '24 21:08 stephenberry