mongo-c-driver icon indicating copy to clipboard operation
mongo-c-driver copied to clipboard

CDRIVER-4740 bson_utf8_escape_for_json performance improvements

Open joshbsiegel opened this issue 5 months ago • 0 comments

Summary

This PR optimizes bson_utf8_escape_for_json to reduce performance problems in bson_as_json.

Motivation

Credit to @alcaeus for the initial find and draft PR. Much of the work in this PR is adapted from his changes. To summarize, bson_as_json was performing about 5x as slow when compared to the PHP driver's equivalent function. It was recognized that bson_utf8_escape_for_json was the main reason for this.

Background

In order to convert a UTF-8 string to JSON, some characters require special treatment. For example, double quotes " must be converted into escaped double quotes \". See this diagram for all of the necessary conversions. As such, when converting to JSON, strings must be parsed character by character to ensure that the resulting string is valid for JSON.

_is_special_char()

Previously, bson_utf8_escape_for_json was checking, converting, and copying over each character one-by-one. Considering that most characters do not require any conversion, a bit mask is now used to see if a character is "special" or not. If not, we simply continue to iterate. When a special character is reached (or end of string), we append of ALL of the normal characters seen up to that point. This reduces the amount of appends to the string, improving performance. bson_string_append_ex was added as a helper for this piece-by-piece strategy.

bson_string_alloc()

Previously, an empty string was created in bson_utf8_escape_for_json and then appended to in order to produce the JSON-safe string. Being that we can guarantee that the string produced from this function will be at least the size of the string passed in, this is wasteful due to the cost of growing the string from size 0. To fix this, a helper function bson_string_alloc was created. This reduces the amount of reallocations needed as we now create a initial string with the minimal known size.

Refactoring

Some basic refactoring was made to bson_utf8_escape_for_json in an attempt to make it more readable. The function structure is as follows:

  1. Iterate until we find a special character or end of string. If end of string, append all remaining characters and return.
  2. Append all the non-special characters up to that point.
  3. Check if current character is a NULL terminator. Handle and append if so. Go to 1.
  4. Check if current character is a non-ASCII Unicode character. Handle and append if so. Go to 1.
  5. We have reached a special ASCII character. Handle and append. Go to 1.

Testing

Unit tests and docs were added for the new public API bson_string_alloc and bson_string_append_ex. In addition to that and running the utf8 tests, the fuzz tester for bson_utf8_escape_for_json successfully ran without issues, which was the initial reason for backlogging the draft PR. This difference in behavior is likely due to the fix from this PR.

Performance Improvements

To test performance, these profiling steps were followed using both a bson_utf8_escape_for_json benchmark and a bson_as_json benchmark.

Benchmark Time Before (s) Time After (s) Improvement
bson_utf8_escape_for_json 35.94 13.16 63.38%
bson_as_json 56.04 35.54 36.58%

What's New

  • _bson_string_alloc() implementation and tests
  • bson_string_append_ex() implementation, documentation, and tests
  • bson_utf8_escape_for_json() optimizations (see above) and refactor
  • Helper functions _is_special_char() and _bson_utf8_handle_special_char()

joshbsiegel avatar Sep 19 '24 16:09 joshbsiegel