json
json copied to clipboard
Feature Request: Parse numbers as strings
When working with various counterparties dealing with monetary systems, we found that, quite often than not, we'd recieve json strings like [1.2345] instead of ["1.2345"]. If we parse that as a double, then we might loose precisions in some cases. In order to preserve precision, we have to parse that number as a string. rapidjson offers a solution by providing kParseNumbersAsStringsFlag: https://rapidjson.org/namespacerapidjson.html#a81379eb4e94a0386d71d15fda882ebc9a13981c0b803803f59d7a01aef3dfc987. Interesting enough, Python standard json library also offers the capability to parse numbers as strings: https://docs.python.org/3/library/json.html#json.load (see parse_float and parse_int parameters). We are looking into migrating to boost json library. Parsing numbers as strings is a key thing for us to preserve monetary precision. Thank you.
possible in theory, if we add it to the parse options. they will come in as strings. However, we can consider adding a flag to json::string
somewhere (if we can find a spare bit) which indicates that the string contains a valid number. This should not affect performance if the option is not set.
We technically already support this for parsing. Just use basic_parser
with a custom handler. The caveat is that this is way more complicated than it should have been. We could make detail::handler
public, and document how to override its functions to achieve custom handling of only a subset of parsing events.
The more complicated part of the eqation is serialisation. We don't have a customisable serialiser. On the other hand, custom serialisation is very easy to implement with iostreams.
So, no special bit for "this is actually a number" is required. BTW, I am sceptical that such change would not affect performance, even if only in a minor way.
@cryptochassis do you only need this special handling for parsing? Is using basic_parser with a custom handler enough for you?
Here's an example of what I meant: https://godbolt.org/z/KE7YK7h97
@grisumbras Very sorry for the late reply. I completely missed your previous messages. Yes, we only need this special handling for parsing. Using basic_parser with a custom handler seems to be sufficient. Thanks a lot for providing a concrete example. One question: for the example, when the parser encounters a number, say, a double, will it still call std::stod behind the scene? Because we are a high-frequency-trading code provider, performance is of utmost importance to us. Without calling std::stod, I'd guess it'd save lots of CPU time.
The number will still be parsed. But our parser doesn't call std::strod
(or any other standard number parsing utility for that matter). We use custom number parsing functions, so maybe it will be fast enough for you.
Also, this made me think we might want a parser option that disables number parsing outright.
We parse about millions of json messages per second and therefore skipping string to number conversion would probably have visible impact on our system's performance. We'd appreciate if there could be provided a parser option that disables number parsing. Many thanks!
if you want the highest performance why don't you use simdjson? Do you need the ability to modify the JSON values?
We don't need the ability to modify the JSON values. At the time that we first started our library development in 2019 and published its first version, simdjson wasn't available. Based on the best judgement at that time, we picked rapidjson. We ourselves is a library rather than an end-user application. The reason that we are now aiming at migrating to boost json instead of simdjson is because a sizable part of our current users (or those who are thinking about using our library) comes from a Python background and therefore are intermediate to beginner levels in C++. They need a simple way of getting started to build their applications using our library. The simplest way is to only rely on the header-only components of boost but nothing else. And we are getting closer to that: currently we only depend on boost, websocketpp, and rapidjson. We are almost there of moving away from websocketpp by using your beast websocket. So now the only thing to trim is rapidjson after which our only dependency are the header-only components of boost. To sum up, the reason is to achieve a good balance between performance and usability aiming at a wide array of audience having vastly different C++ proficiencies.
Wow... that rationale is actually rather perfect :)
The number will still be parsed. But our parser doesn't call
std::strod
(or any other standard number parsing utility for that matter). We use custom number parsing functions, so maybe it will be fast enough for you.Also, this made me think we might want a parser option that disables number parsing outright.
Let me know whether we can have such a parser option. Thanks a lot.
An option to disable number parsing outright? I have a PR for that (#901). IIRC, my benchmarking shows that for number-heavy inputs the speed of parsing increases by 80% (but don't quote me on that). @vinniefalco should I pursue it?
To be clear, it still sort of does number validation (we need it to know when the number ends and the parser should start parsing another value), it just doesn't convert the characters into a number.
its an interesting mode
An option to disable number parsing outright? I have a PR for that (#901). IIRC, my benchmarking shows that for number-heavy inputs the speed of parsing increases by 80% (but don't quote me on that). @vinniefalco should I pursue it?
Perfect. Looking forward to the finalization. Thanks a lot.
#901 has been merged into develop
Local benchmarking results:
imprecise | precise | none
Parse gcc apache_builds.json 754 | 753 -0,13%| 753 -0,13%
Parse gcc canada.json 587 | 400 -31,86%|1064 81,26%
Parse gcc citm_catalog.json 1231|1232 0,08%|1344 9,18%
Parse gcc github_events.json 837 | 845 0,96%| 850 1,55%
Parse gcc gsoc-2018.json 975 | 977 0,21%| 974 -0,10%
Parse gcc instruments.json 630 | 640 1,59%| 659 4,60%
Parse gcc marine_ik.json 531 | 404 -23,92%| 654 23,16%
Parse gcc mesh.json 532 | 402 -24,44%| 690 29,70%
Parse gcc mesh.pretty.json 996 | 758 -23,90%|1370 37,55%
Parse gcc numbers.json 818 | 494 -39,61%|1814 121,76%
Parse gcc random.json gcc 383 | 384 0,26%| 385 0,52%
Parse gcc twitter.json 521 | 524 0,58%| 530 1,73%
Parse gcc twitterescaped.json 478 | 474 -0,84%| 488 2,09%
Parse gcc update-center.json 660 | 664 0,61%| 663 0,45%
Parse clang apache_builds.json 757| 750 -0,92%| 751 -0,79%
Parse clang canada.json 613| 378 -38,34%| 905 47,63%
Parse clang citm_catalog.json 1225|1196 -2,37%|1234 0,73%
Parse clang github_events.json 800| 793 -0,88%| 807 0,88%
Parse clang gsoc-2018.json 721| 721 0,00%| 717 -0,55%
Parse clang instruments.json 674| 653 -3,12%| 664 -1,48%
Parse clang marine_ik.json 532| 400 -24,81%| 607 14,10%
Parse clang mesh.json 557| 418 -24,96%| 708 27,11%
Parse clang mesh.pretty.json 1086| 771 -29,01%|1373 26,43%
Parse clang numbers.json 854| 524 -38,64%|1742 103,98%
Parse clang random.json 377| 371 -1,59%| 372 -1,33%
Parse clang twitter.json 556| 558 0,36%| 557 0,18%
Parse clang twitterescaped.json 463| 470 1,51%| 468 1,08%
Parse clang update-center.json 594| 597 0,51%| 594 0,00%
canada.json is +81% on GCC and +48% on clang, numbers.json is +122% on GCC and +104% on clang.