jsmn
jsmn copied to clipboard
RFC 8259, bitwise checks, next siblings, doxygen comments, and more
This PR does a number of things, but the main points are:
- it is fully RFC-8259 compliant
- enables strict mode by default and adds a permissive mode
- extendeds
jsmntype_t
to:- allow bitwise checks for better json validation while parsing
- add more information in post processing
- define the rules for both strict and permissive modes
- adds extra validation when
tokens == NULL
- the parser keeps track of depth and type of container down to 16 or 32 levels deep depending on the
JSMN_SHORT_TOKENS
flag
- the parser keeps track of depth and type of container down to 16 or 32 levels deep depending on the
- adds siblings to traverse through a single level
- if there are more than one json string in the buffer, the first element of the first json string links to the first element of the second json string through the sibling
- allows three different modes for treating multiple json objects in the same buffer
- the default mode stops at the end of a json object allowing the user to parse the next json object if there is one
- a new function is available to query if there is anything left in the buffer
-
JSMN_MULTIPLE_JSON
parses multiple json objects in a single buffer -
JSMN_MULTIPLE_JSON_FAIL
parses one object will fail if there is any non-whitespace character following it in the buffer- this is probably the mode that follows the intent of RFC-8259 the best
- the default mode stops at the end of a json object allowing the user to parse the next json object if there is one
-
JSMN_PERMISSIVE
flags have been extended to be more modular-
JSMN_PERMISSIVE_KEY
allows PRIMITIVEs to be OBJECT KEYs -
JSMN_PERMISSIVE_PRIMITIVE
allows PRIMITIVEs to be any contiguous value
-
- adds doxygen compatible comments and doxygen.cfg
Code-wise, the opening and closing of containers, colons, and commas have been pulled out of jsmn_parse
and put into their own functions (jsmn_parse_
prefix with container_open
, container_close
, colon
, and comma
respectively). jsmn_parse
is extremely clean. Overall, the code is quite verbose but most of it reduces to bit checks and comparisons.
For post processing, every token has multiple flags set.
-
JSMN_KEY
orJSMN_VALUE
-
JSMN_OBJECT
,JSMN_ARRAY
,JSMN_STRING
, orJSMN_PRIMITIVE
These base flags are combined for complex types such as:
-
JSMN_OBJ_VAL
for an object that is also a value, or -
JSMN_STR_KEY
for a string that is also a key.
Primitives are further flagged:
-
JSMN_PRI_LITERAL
for true, false, or null -
JSMN_PRI_MINUS
if it is a negative number -
JSMN_PRI_DECIMAL
if it is a float or double -
JSMN_PRI_EXPONENT
if it has an exponent
This and a lot of validation is done with an enum that can be packed into two bytes. With the enum, vanilla jsmntok_t, parent, and sibling, the final size of the jsmntok_t can be reduced to 12 bytes, or 3 ints, per token.
I have also added over 300 validation tests courtesy of JSONTestSuite. The tests are set up in three groups: the first are the tests that should pass, second are the tests that shouldn't pass, and third are the tests that may pass based on the json parser implementation.
This PR should resolve #118, resolve #131, resolve #137, resolve #158, resolve #163, resolve #177, resolve #178, and resolve #179. I believe this can also resolve #164 by being able to traverse the tokens by sibling.
This should be able to close PR #93, PR #102, PR #108, PR #166, PR #168, PR #172, PR #192, and PR #198.
I also plan on getting utf8 into here with a flag to enable or disable it as to avoid jsmn ballooning in size for small devices.
A strong set of permissive rules would be welcome to further refine my rules.
Hello,
There seems to be an issue in this PR where a string like
char s[] = "{\"array\":[1,2]}";
fails, when an array is inside an object.
It can be fixed by changing
if (type & (JSMN_OBJECT | JSMN_INSD_OBJ))
to
if (type & (JSMN_OBJECT | JSMN_INSD_OBJ) && !(type & JSMN_ARRAY))
in jsmn_parse_comma
@atomsnc - Sorry, I saw your post and then life hit me for a bit.
Very good catch. I don't know how I wasn't testing for that case previously. I think your fix fixes the symptom but not the cause, so I solved the problem a little further up by ignoring JSMN_INSD_OBJ
altogether.
Thank you, and I am also pleased that someone could get through my code.
A massive pr
I really like how strict mode is the default here with possibility of enabling specific relaxations. Especially ones that help cut corners just to squeeze extra performance by reducing conformance. I also think introduction of the sibling feature is a good idea, helps a lot when traversing a more complex JSON.
Myself I mostly was thinking about trying to fix issues with JSMN without introducing additional fields to parser's state. But honesty, by just saving what's expected in the future should just result in more performant code, as bitwise checks ought to be faster and cleaner than deriving what should come next from current tokens and input string.
I think I will try to mix your PR with @dominickpastore to deal with certain parts that just personally felt ugly to me. Throughout the next week or two I will try to battle with git in an attempt to do it properly and push the results onto a separate branch.
@pt300 I appreciate the comments.
A massive pr
It certainly is.
I swiped the next_sibling idea from #68 and shorter tokens and unsigned ints from #93 years ago when I first found jsmn.
Myself I mostly was thinking about trying to fix issues with JSMN without introducing additional fields to parser's state. But honesty, by just saving what's expected in the future should just result in more performant code, as bitwise checks ought to be faster and cleaner than deriving what should come next from current tokens and input string.
I made sure to keep the original types and added new types to be able to check for after parsing, and these led into adding extra states into the same enum for deeper checks by knowing what is legal as the next types. I worked hard to keep everything in a 32-bit uint.
I like the enums and bitwise checks because it is calculated at compile time but makes the code easier to read in places. expected = JSMN_PRI_INTEGER | JSMN_PRI_DECIMAL | JSMN_PRI_EXPONENT | JSMN_CLOSE;
lets you be able to compare the code to the json rfc standard easily and it is all compile time code.
I think I will try to mix your PR with @dominickpastore to deal with certain parts that just personally felt ugly to me. Throughout the next week or two I will try to battle with git in an attempt to do it properly and push the results onto a separate branch.
I look forward to it. I hope you can feel the spirit of jsmn's current release in my PR because it was a great starting point and I tried to keep as much of it as I could while make some significant changes. I am curious though; what parts feel ugly to you? I know I can write some incredibly dense code that isn't the easiest to follow.
After the last sweep through your code it's been mostly few smaller things like the use of defined() in preprocessor macros instead of #ifdef and #ifndef. There was something else but I cannot remember at this time. I will be looking thoroughly once more and will let you know if I find any issues. Also, just a reminder that I am no expert. Feel free to say so if what I say feels wrong.
@themobiusproject I feel like a worthwhile addition would be to have a coalesce strategy in the utils
for open tokens when the tokens array cannot be large.
For tokens which have token.end != JSMN_NEG
, they can either by dropped or moved out with a callback.
next_sibling
will need to be computed again.
Something like
int coalesce_open_tokens(jsmntok_t *tokens, jsmn_parser *parser) {
jsmnint_t i, index = 0;
for (i = 0; i < parser->count; i++) {
if (tokens[i].end == JSMN_NEG) {
tokens[index++] = tokens[i];
}
#if defined(JSMN_NEXT_SIBLING)
else {
/* Relink siblings if any */
for (jsmnint_t j = 0; j < i; j++) {
if (tokens[j].next_sibling == i) {
tokens[j].next_sibling = tokens[i].next_sibling;
break;
}
}
}
#endif
}
/* Return 0 if all the tokens are being worked on. We truly cannot add more. */
if (index == parser->count) return 0;
parser->count = index;
parser->toknext = index;
return 1;
}
Above code is untested.
@pt300
After the last sweep through your code it's been mostly few smaller things like the use of defined() in preprocessor macros instead of #ifdef and #ifndef. There was something else but I cannot remember at this time. I will be looking thoroughly once more and will let you know if I find any issues. Also, just a reminder that I am no expert. Feel free to say so if what I say feels wrong.
I actually also prefer #ifdef
and #ifndef
, but the reason I went with #if defined()
and #if !defined()
is because of the occasional instance of #if defined() && defined()
which looked horrible when surrounded by #if{,n}def
s.
@atomsnc
I feel like a worthwhile addition would be to have a coalesce strategy in the
utils
for open tokens when the tokens array cannot be large.
Could you better describe this use case? I am trying to figure out where it fits in so I can better understand the code snippet.
@themobiusproject When the tokens array for storing tokens while parsing cannot be too big (especially in embedded devices). In that case when the array gets full, user can evict a few completed tokens to parse a bit more.
I have picked most of the changes you've done in this PR and those are now visible in experimental branch. For now I have left out the utilities and the example utilising them.
Definitely there are more changes coming to what is on the branch right. Mostly cleaning up the code and actually making sure there's no space for nasty surprises. Then, it will go back to adding features again.
@pt300 @themobiusproject I just checked out the experimental branch that contains this patch. However, now example/simple.c doesn't parse the JSON in it anymore. This here still works:
{"user": "johndoe", "admin": false, "uid": 1000, "groups": ["users"]}
but this here won't parse anymore:
{"user": "johndoe", "admin": false, "uid": 1000, "groups": ["users", "wheel", "audio", "video"]}
This are the warnings during compilation and the (invalid) result of the run:
holger@holger:~/jsmn$ gcc -Wall -Wextra -oa example/simple.c
example/simple.c: In function ‘jsoneq’:
example/simple.c:16:50: warning: comparison of integer expressions of different signedness: ‘int’ and ‘jsmnint’ {aka ‘unsigned int’} [-Wsign-compare]
16 | if (tok->type == JSMN_STRING && (int)strlen(s) == tok->end - tok->start &&
| ^~
example/simple.c: In function ‘main’:
example/simple.c:66:21: warning: comparison of integer expressions of different signedness: ‘int’ and ‘jsmnint’ {aka ‘unsigned int’} [-Wsign-compare]
66 | for (j = 0; j < t[i + 1].size; j++) {
| ^
holger@holger:~/jsmn$ ./a
Failed to parse JSON: -3
I also tried then PRs 194-196 on top of master. And with that, the example/simple.c still runs.
@holgerschurig I just added the json from the simple.c to my tests and it parses. You can check 0560339. When @pt300 merged my code, he grabbed it before 9f7e9de1 which did leave an error in parsing. That commit needs to be applied to the experimental branch and it should fix the problem.
As for the files in the example, I only ever touch explode.c because I wrote that one along with the jsmn_explodeJSON
function associated with it. You can copy JSON_STRING
from simple.c (or your unescaped version in your previous comment) into library.json in the root directory and run explode
from the examples folder, it will parse and show more information than you will probably want to see.
Sorry again for the inactivity.
This looks odd, as I remember the tests were passing. I will make sure to look into this tomorrow and put the patch in.
I also really wanted to simplify some parts of the parsing, such as reparsing some tokens from the beginning instead of trying to continue parsing. Should cut down complexity without major performance hit + make the code easier to maintain and more difficult to break. But, you can see how well it's going. Hope you can forgive me.
I have applied the changes, but there seem to be still an issue with parsing. I will have to revisit it some other time.