mongo-c-driver
mongo-c-driver copied to clipboard
CDRIVER-4740 bson_utf8_escape_for_json performance improvements
Summary
This PR optimizes bson_utf8_escape_for_json
to reduce performance problems in bson_as_json
.
Motivation
Credit to @alcaeus for the initial find and draft PR. Much of the work in this PR is adapted from his changes. To summarize, bson_as_json
was performing about 5x as slow when compared to the PHP driver's equivalent function. It was recognized that bson_utf8_escape_for_json
was the main reason for this.
Background
In order to convert a UTF-8 string to JSON, some characters require special treatment. For example, double quotes "
must be converted into escaped double quotes \"
. See this diagram for all of the necessary conversions. As such, when converting to JSON, strings must be parsed character by character to ensure that the resulting string is valid for JSON.
_is_special_char()
Previously, bson_utf8_escape_for_json
was checking, converting, and copying over each character one-by-one. Considering that most characters do not require any conversion, a bit mask is now used to see if a character is "special" or not. If not, we simply continue to iterate. When a special character is reached (or end of string), we append of ALL of the normal characters seen up to that point. This reduces the amount of appends to the string, improving performance. bson_string_append_ex
was added as a helper for this piece-by-piece strategy.
bson_string_alloc()
Previously, an empty string was created in bson_utf8_escape_for_json
and then appended to in order to produce the JSON-safe string. Being that we can guarantee that the string produced from this function will be at least the size of the string passed in, this is wasteful due to the cost of growing the string from size 0. To fix this, a helper function bson_string_alloc
was created. This reduces the amount of reallocations needed as we now create a initial string with the minimal known size.
Refactoring
Some basic refactoring was made to bson_utf8_escape_for_json
in an attempt to make it more readable. The function structure is as follows:
- Iterate until we find a special character or end of string. If end of string, append all remaining characters and return.
- Append all the non-special characters up to that point.
- Check if current character is a NULL terminator. Handle and append if so. Go to 1.
- Check if current character is a non-ASCII Unicode character. Handle and append if so. Go to 1.
- We have reached a special ASCII character. Handle and append. Go to 1.
Testing
Unit tests and docs were added for the new public API bson_string_alloc
and bson_string_append_ex
. In addition to that and running the utf8
tests, the fuzz tester for bson_utf8_escape_for_json
successfully ran without issues, which was the initial reason for backlogging the draft PR. This difference in behavior is likely due to the fix from this PR.
Performance Improvements
To test performance, these profiling steps were followed using both a bson_utf8_escape_for_json
benchmark and a bson_as_json
benchmark.
Benchmark | Time Before (s) | Time After (s) | Improvement |
---|---|---|---|
bson_utf8_escape_for_json |
35.94 | 13.16 | 63.38% |
bson_as_json |
56.04 | 35.54 | 36.58% |
What's New
-
_bson_string_alloc()
implementation and tests -
bson_string_append_ex()
implementation, documentation, and tests -
bson_utf8_escape_for_json()
optimizations (see above) and refactor - Helper functions
_is_special_char()
and_bson_utf8_handle_special_char()