mongo-c-driver CDRIVER-4740 Improve performance of bson_utf8_escape_for

While implementing the BSON performance benchmarks in the PHP driver, we noticed a performance problem when converting BSON to JSON. In our tests, we found that encoding a bson_t* to JSON using bson_as_json was significantly slower than converting the BSON to PHP objects using a bson_visitor_t, then running PHP's own JSON encoder on it. Using bson_as_json was slower by about a factor of 5. A bit of trial and error brought me to bson_utf8_escape_for_json: since the performance payload didn't need it, I removed the calls for testing purposes and the performance of bson_as_json improved dramatically to the point where it was not significantly slower than JSON encoding using PHP's logic.

This pull request is the result of the analysis @jmikola and I did, comparing libbson's escape logic with PHP's.

The first commit introduces a performance test for bson_utf8_escape_for_json. This test escapes the string "my\0key" 1 million times. The baseline time for this was 0.359726 according to test output.

Comparing libbson's implementation with PHP's php_json_escape_string, I found several fixes which I applied commit by commit:

The first fix regards allocation of a new bson_string_t for the escaped value. Previously, an empty bson_string_t was created, which then grows by factors of 2 as we append escaped characters. Since we know that escaping a string of length n produces a string with a length greater or equal than n, I added a bson_string_alloc similar to bson_string_new, but allocating a number of bytes while setting the length to 0. This reduces the number of reallocations while encoding and resulted in a benchmark time of 0.327364 seconds, yielding a ~9% improvement.

The second commit was heavily inspired by PHP's logic in php_json_escape_string and changes the way many "straightforward" characters are handled. To determine whether characters need escaping, it leverages a bit mask. Note that our bit mask looks slightly different from PHP's, as PHP offers to escape significantly more characters depending on options. Characters that don't need escaping are no longer copied to the escaped string one by one, but rather in a single chunk once the first special character is encountered. This reduces the number of calls to memcpy. To help with this, I extracted bson_string_append_ex which appends a given number of characters from the source string to the destination. This is necessary as we don't want to append the entire unescaped string, but only a certain number of characters. Any non-ASCII characters (c >= 0x80) are appended using the previous logic of using bson_string_append_unichar. At this point in the loop, the only characters left to handle are those that need escaping (", \, \b, \f, \n, \r, \t) and all non-printable ASCII characters, which are added either escaped, or as code points (\uxxxx).

This change resulted in a significant performance improvement, bringing the benchmark time down to 0.207105 seconds, a 42% improvement compared to the baseline.

The last optimisation is around adding code points: I noticed that bson_string_append_printf was rather slow, especially when the format is rather predictable. A new bson_string_append_codepoint method optimises for this path, bypassing printf for small enough values.

This last optimisation further reduces the benchmark time to 0.122769 seconds, a whopping 65% faster than the original.

Here's a summary of the benchmark results:

Test	Time	Improvement
Baseline	0,359726
`bson_string_alloc`	0,327364	8,996 %
Bit mask optimisation	0,207105	42,427 %
Code point optimisation	0,122769	65,872 %

For now, I created this PR as a draft as I'm sure there are a number of potential improvements. For one, all of the new methods are part of the public API, which might not be desired. I'm also sure there are a number of tests that can be added to cover all potential "unhappy paths".

Sep 29 '23 14:09 alcaeus

I recommend the use of a fuzz tester on bson_utf8_escape_for_json. Here is a branch with a fuzz test in this commit: https://github.com/mongodb/mongo-c-driver/commit/50f36d703a90cd2a05a4a66782570cd7832da923

Testing shows a string that returns true for bson_utf8_validate but results in a NULL return from bson_utf8_escape_for_json:

// Testing \x0a\x0a\xc0\x80 results in NULL return.
BSON_ASSERT (bson_utf8_validate ("\x0a\x0a\xc0\x80", 4, true /* allow_null */));
str = bson_utf8_escape_for_json ("\x0a\x0a\xc0\x80", 4);
BSON_ASSERT (str); // Fails.
bson_free (str);

Oct 18 '23 17:10 kevinAlbs

Continued in #1732.

Sep 20 '24 06:09 alcaeus

mongo-c-driver
mongo-c-driver copied to clipboard

CDRIVER-4740 Improve performance of bson_utf8_escape_for_json

mongo-c-driver mongo-c-driver copied to clipboard

CDRIVER-4740 Improve performance of bson_utf8_escape_for_json

mongo-c-driver
mongo-c-driver copied to clipboard