doris icon indicating copy to clipboard operation
doris copied to clipboard

[Enhancement](stream-load-json) using simdjson to parse json

Open eldenmoon opened this issue 2 years ago • 0 comments

Proposed changes

Currently we use rapidjson to parse json document, It's fast but not fast enough compare to simdjson.And I found that the simdjson has a parsing front-end called simdjson::ondemand which will parse json when accessing fields and could strip the field token from the original document, using this feature we could reduce the cost of string copy(eg. we convert everthing to a string literal in _write_data_to_column by sprintf, I saw a hotspot from the flamegrame in this function, using simdjson::to_json_string will strip the token(a string piece) which is std::string_view and this is exactly we need).And second in _set_column_value we could iterate through the json document by for (auto field: object_val) {xxx}, this is much faster than looking up a field by it's field name like objectValue.FindMember("k1").The third optimization is the at_pointer interface simdjson provided, this could directly get the json field from original document.

Issue Number: close #11663

Problem summary

using simdjson to parse in vjsonscanner.

Checklist(Required)

  1. Does it affect the original behavior:
    • [ ] Yes
    • [x] No
    • [ ] I don't know
  2. Has unit tests been added:
    • [x] Yes
    • [ ] No
    • [ ] No Need
  3. Has document been added or modified:
    • [ ] Yes
    • [x] No
    • [x] No Need
  4. Does it need to update dependencies:
    • [ ] Yes
    • [x] No
  5. Are there any changes that cannot be rolled back:
    • [ ] Yes (If Yes, please explain WHY)
    • [x] No

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

eldenmoon avatar Aug 11 '22 02:08 eldenmoon