doris
doris copied to clipboard
[Enhancement] support simdjson to parse json document when load
Search before asking
- [X] I had searched in the issues and found no similar issues.
Description
provide simdjson to parse json document in json scanner
Solution
Currently we use rapidjson to parse json document, It's fast but not fast enough compare to simdjson.And I found that the simdjson has a parsing front-end called simdjson::ondemand which will parse json when accessing fields and could strip the field token from the original document, using this feature we could reduce the cost of string copy(eg. we convert everthing to a string literal in _write_data_to_column by sprintf
, I saw a hotspot from the flamegrame in this function, using simdjson::to_json_string will strip the token(a string piece) which is std::string_view and this is exactly we need).And second in _set_column_value
we could iterate through the json document by for (auto field: object_val) {xxx}
, this is much faster than looking up a field by it's field name like objectValue.FindMember("k1")
.The third optimization is the at_pointer
interface simdjson provided, this could directly get the json field from original document.
bellow is the performance result from my benchmark using stream load:
using config::enable_simdjson_reader=true
to turn on simdjson reader to parse
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Why SIMDJson is slower using httplogs? Any differences between those benchmark?
is json a big array object in you stream load test or 9000 line json(turn on read_json_by_line)?
@Gabriel39 sorry the order is wrong, i'll fix this
@caiconghui I turn on read_json_by_line