jsmn
jsmn copied to clipboard
Comply with RFC 8259 in strict mode
This PR brings strict mode in compliance with RFC 8259 (as well as some other changes):
- Various invalid inputs that were accepted before are now correctly rejected (Fixes #41, fixes #52, fixes #117, fixes #131, fixes #158)
- Primitives and strings are fully validated according to RFC 8259 (Fixes #178). (And in non-strict mode, both are now fully relaxed.)
- Strings and primitives are allowed at the root level, not just objects and arrays. (As part of this, fixed #179, which started affecting strict mode with this change)
There is an important caveat with primitives at the root level related to #179: If a number (or any primitive in non-strict mode) is split across incremental parses at the root level, an incomplete token will likely be produced on the first pass instead of
JSMN_ERROR_PART
. The parser has no way of knowing the token is incomplete. It will be corrected on the second pass when the full input string is available. - Arrays and objects disallowed as keys due to issues they cause (Fixes #193). This isn't related to strict mode at all, but while digging in to implement the above, it made sense to address this as well. More on why this change made sense later.
- Calling
jsmn_parse()
withtokens
set toNULL
skips a lot of the validation, even in strict mode. (The assumption being that this is almost always called right before ajsmn_parse()
withtokens
allocated, so for almost all use cases, this saves from doing the validations twice.)
At the same time, this led to some significant changes to non-strict mode (consistent with comments I previously made on issue #154). In summary, non-strict mode is now limited to the following three deviations from RFC 8259: 1) Primitives can be made of any character that doesn't have special meaning in JSON (whitespace and any of {}[],:"
), 2) Strings can contain any character and invalid backslash escapes, and 3) Primitives can be object keys, not just strings. Reasoning for these changes is below. (But see also comment later.)
Unenclosed objects, missing values, and missing commas
Non-strict mode previously allowed the following examples to parse:
"server": "127.0.0.1"
"port": 80
"message": "Hello world"
{"a":1 "b":2}
{
"a":1,
"b"
}
I believe this was intentional; however, this relaxed validation carried over into strict mode in several places (see issues #52, #131, #158). I added a simple state machine to the parser to fix the issues in strict mode, but fixing strict mode without affecting non-strict mode would have been difficult. As a result, non-strict mode no longer allows the above examples. I can think of ways we could make it work, but the added complexity didn't seem worth it.
Arrays and objects as keys
Non-strict mode no longer allows arrays and objects as object keys. This eliminates issue #193, but breaks backward compatibility for some non-RFC-8259-compliant JSON. Here is the reasoning:
- When we add a value to a key, we add one to the key's size and the key becomes the value's parent. This does a couple nice things:
- Keys can be identified by checking for non-arrays and non-objects with
size
1 - The total number of tokens in any JSON object (or array) is the recursive sum of that object's
size
and its children'ssize
s, plus one for the object itself (very convenient for skipping array items or members of an object)
- Keys can be identified by checking for non-arrays and non-objects with
- When arrays and objects are keys, not only are those promises broken, but implementation is non-trivial, since
size
is already used for their own contents. This is the cause of issue #193. I could think of three potential fixes:- Don't increment
size
for a key when it's an array or object. This fixes #193 but still breaks the promises above, and adds noticeable complexity to the code. I don't like this solution. - Differentiate between container-child relationship and key-value relationship. Add a way to identify keys without using
size
: either a newkey
field injsmntok_t
or modifyjsmntype_t
:
and then add atypedef enum { JSMN_UNDEFINED = 0x0, JSMN_OBJECT = 0x1, JSMN_ARRAY = 0x2, JSMN_STRING = 0x4, JSMN_PRIMITIVE = 0x8, JSMN_OBJECT_KEY = 0x11, JSMN_ARRAY_KEY = 0x12, JSMN_STRING_KEY = 0x14, JSMN_PRIMITIVE_KEY = 0x18, } jsmntype_t;
keytok
field alongsidesupertok
injmsn_parser
. This could be enabled only whenJSMN_STRICT
is not defined. I don't hate this idea, but it is extra work for what is ultimately non-RFC-8259-compliant JSON. It still doesn't seem with the effort. - Don't allow objects and arrays to be keys, sidestepping the issue entirely. This minimizes the amount of extra code for non-strict mode. This is the solution I chose. If really missed, we could add array/object keys back as a separate change.
- Don't increment
Conclusion
These are some pretty significant changes to jsmn, but seeing the number of issues about invalid input being accepted, I hope they are well received. I have some ideas for future improvements also, but I will discuss that on the Roadmap issue (#159).
Pull request is in draft mode for now as I need to make a couple changes to fix parsing multiple JSON objects at once and improve validation for primitives, but I wanted to go ahead and get it out there in case there is any potential discussion.
The remaining changes are now complete, and I've added test cases for the new functionality. Marking as "Ready to review."
This is pretty close to what I have come up with myself. I went about the problem from two sides, one being to extend jsmn_t, the second adding an extra field to the jsmn_parser.
My jsmn_t ends up being:
typedef enum {
JSMN_UNDEFINED = 0x00,
JSMN_OBJECT = 0x01, //!< Object
JSMN_ARRAY = 0x02, //!< Array
JSMN_STRING = 0x04, //!< String
JSMN_PRIMITIVE = 0x08, //!< Other primitive: number, boolean (true/false) or null
#ifdef JSMN_STRICT
JSMN_ANY_TYPE = 0x0F, //!< (OBJECT | ARRAY | STRING | PRIMITIVE)
JSMN_CLOSE = 0x10, //!< Close OBJECT '}' or ARRAY ']'
JSMN_DELIMITER = 0x20, //!< Colon ':' after KEY, Comma ',' after VALUE
JSMN_KEY = 0x40, //!< is a key
JSMN_VALUE = 0x80, //!< is a value
#endif
} jsmntype_t;
and the jsmn_parser ends up as
typedef struct jsmn_parser {
jsmnint_t pos; //!< offset in the JSON string
jsmnint_t toknext; //!< next token to allocate
jsmnint_t toksuper; //!< superior token node, e.g. parent object or array
#ifdef JSMN_STRICT
jsmnenumtype_t expected; //!< Expected jsmn type(s)
#endif
} jsmn_parser;
With this, after the opening of an object '{' or an array '[' I can set the set the next item that is expected:
if (token->type & JSMN_OBJECT) {
parser->expected = (JSMN_STRING | JSMN_CLOSE);
} else {
parser->expected = (JSMN_ANY_TYPE | JSMN_CLOSE);
}
and then check parser->expected against what the next item actually is
if ((parser->expected & JSMN_PRIMITIVE) == 0) {
// error handling
}
// continue with parsing
This allows for bitwise checks for error handling and while being able to put in strong rules for what can follow what.
I also like leaving the combinations (minus JSMN_ALL_TYPE) implicit, but I could also extend jsmntype_t to include the explicit classes you have, though I would suggest letting bitwise logic set the values for you.
typedef enum {
JSMN_UNDEFINED = 0x00,
JSMN_OBJECT = 0x01, //!< Object
JSMN_ARRAY = 0x02, //!< Array
JSMN_STRING = 0x04, //!< String
JSMN_PRIMITIVE = 0x08, //!< Other primitive: number, boolean (true/false) or null
#ifdef JSMN_STRICT
JSMN_ANY_TYPE = 0x0F, //!< (OBJECT | ARRAY | STRING | PRIMITIVE)
JSMN_CLOSE = 0x10, //!< Close OBJECT '}' or ARRAY ']'
JSMN_DELIMITER = 0x20, //!< Colon ':' after KEY, Comma ',' after VALUE
JSMN_KEY = 0x40, //!< is a key
JSMN_VALUE = 0x80, //!< is a value
JSMN_OBJECT_KEY = (JSMN_KEY | JSMN_OBJECT),
JSMN_ARRAY_KEY = (JSMN_KEY | JSMN_ARRAY),
JSMN_STRING_KEY = (JSMN_KEY | JSMN_STRING),
JSMN_PRIMITIVE_KEY = (JSMN_KEY | JSMN_PRIMITIVE),
#endif
} jsmntype_t;
This is pretty close to what I have come up with myself. I went about the problem from two sides, one being to extend jsmn_t, the second adding an extra field to the jsmn_parser.
My jsmn_t ends up being:
typedef enum { JSMN_UNDEFINED = 0x00, JSMN_OBJECT = 0x01, //!< Object JSMN_ARRAY = 0x02, //!< Array JSMN_STRING = 0x04, //!< String JSMN_PRIMITIVE = 0x08, //!< Other primitive: number, boolean (true/false) or null #ifdef JSMN_STRICT JSMN_ANY_TYPE = 0x0F, //!< (OBJECT | ARRAY | STRING | PRIMITIVE) JSMN_CLOSE = 0x10, //!< Close OBJECT '}' or ARRAY ']' JSMN_DELIMITER = 0x20, //!< Colon ':' after KEY, Comma ',' after VALUE JSMN_KEY = 0x40, //!< is a key JSMN_VALUE = 0x80, //!< is a value #endif } jsmntype_t;
and the jsmn_parser ends up as
typedef struct jsmn_parser { jsmnint_t pos; //!< offset in the JSON string jsmnint_t toknext; //!< next token to allocate jsmnint_t toksuper; //!< superior token node, e.g. parent object or array #ifdef JSMN_STRICT jsmnenumtype_t expected; //!< Expected jsmn type(s) #endif } jsmn_parser;
With this, after the opening of an object '{' or an array '[' I can set the set the next item that is expected:
if (token->type & JSMN_OBJECT) { parser->expected = (JSMN_STRING | JSMN_CLOSE); } else { parser->expected = (JSMN_ANY_TYPE | JSMN_CLOSE); }
and then check parser->expected against what the next item actually is
if ((parser->expected & JSMN_PRIMITIVE) == 0) { // error handling } // continue with parsing
This allows for bitwise checks for error handling and while being able to put in strong rules for what can follow what.
That is an interesting approach. Do you plan to share the implementation in a pull request?
I also like leaving the combinations (minus JSMN_ALL_TYPE) implicit, but I could also extend jsmntype_t to include the explicit classes you have, though I would suggest letting bitwise logic set the values for you.
typedef enum { JSMN_UNDEFINED = 0x00, JSMN_OBJECT = 0x01, //!< Object JSMN_ARRAY = 0x02, //!< Array JSMN_STRING = 0x04, //!< String JSMN_PRIMITIVE = 0x08, //!< Other primitive: number, boolean (true/false) or null #ifdef JSMN_STRICT JSMN_ANY_TYPE = 0x0F, //!< (OBJECT | ARRAY | STRING | PRIMITIVE) JSMN_CLOSE = 0x10, //!< Close OBJECT '}' or ARRAY ']' JSMN_DELIMITER = 0x20, //!< Colon ':' after KEY, Comma ',' after VALUE JSMN_KEY = 0x40, //!< is a key JSMN_VALUE = 0x80, //!< is a value JSMN_OBJECT_KEY = (JSMN_KEY | JSMN_OBJECT), JSMN_ARRAY_KEY = (JSMN_KEY | JSMN_ARRAY), JSMN_STRING_KEY = (JSMN_KEY | JSMN_STRING), JSMN_PRIMITIVE_KEY = (JSMN_KEY | JSMN_PRIMITIVE), #endif } jsmntype_t;
It seems you have already noticed, but for the benefit of other readers, I use a new jsmnstate_t
enum in the same way as your extended jsmntype_t
:
typedef enum {
JSMN_STATE_ROOT = 0x0, /* At root */
JSMN_STATE_OBJ_NEW = 0x16, /* Expecting object key or } */
JSMN_STATE_OBJ_KEY = 0x12, /* Expecting object key */
JSMN_STATE_OBJ_COLON = 0x13, /* Expecting object colon */
JSMN_STATE_OBJ_VAL = 0x10, /* Expecting object value */
JSMN_STATE_OBJ_COMMA = 0x15, /* Expecting object comma or } */
JSMN_STATE_ARRAY_NEW = 0x24, /* Expecting array item or ] */
JSMN_STATE_ARRAY_ITEM = 0x20, /* Expecting array item */
JSMN_STATE_ARRAY_COMMA = 0x25 /* Expecting array comma or ] */
} jsmnstate_t;
I probably could have used explicit constants for individual bits to make the code more readable and maintainable (in fact, I probably should add it still):
typedef enum {
JSMN_DELIMITER = 0x1, /* Expecting colon if JSMN_KEY, comma if not */
JSMN_KEY = 0x2, /* Expecting an object key */
JSMN_CAN_CLOSE = 0x4, /* Object or array can be closed */
JSMN_CONTAINER_OBJ = 0x10, /* Inside an object */
JSMN_CONTAINER_ARRAY = 0x20, /* Inside an array */
JSMN_STATE_ROOT = 0x0, /* At root */
JSMN_STATE_OBJ_NEW = (JSMN_CONTAINER_OBJ | JSMN_KEY | JSMN_CAN_CLOSE),
JSMN_STATE_OBJ_KEY = (JSMN_CONTAINER_OBJ | JSMN_KEY),
JSMN_STATE_OBJ_COLON = (JSMN_CONTAINER_OBJ | JSMN_KEY | JSMN_DELIMITER),
JSMN_STATE_OBJ_VAL = (JSMN_CONTAINER_OBJ),
JSMN_STATE_OBJ_COMMA = (JSMN_CONTAINER_OBJ | JSMN_DELIMITER | JSMN_CAN_CLOSE),
JSMN_STATE_ARRAY_NEW = (JSMN_CONTAINER_ARRAY | JSMN_CAN_CLOSE),
JSMN_STATE_ARRAY_ITEM = (JSMN_CONTAINER_ARRAY),
JSMN_STATE_ARRAY_COMMA = (JSMN_CONTAINER_ARRAY | JSMN_DELIMITER | JSMN_CAN_CLOSE)
} jsmnstate_t;
The reason for the explicit combinations, though, is those are the actual values that the parser state can have at any given time.
One other thing I forgot to put in the description: calling jsmn_parse()
with tokens
set to NULL
skips a lot of the validation that would otherwise happen, for the sake of performance. (The assumption being that such a call would almost always be followed by a call where tokens
has been allocated, so no need to do all the validations twice.)
I will add an extra bullet point to the description for this.
@themobiusproject I made some improvements based on your suggestion to use bitwise logic for the enum values. I think it helps readability a lot.
I've added a branch with a fourth non-strict extension in my own fork: valueless-keys
. It adds the ability to parse objects with missing values, e.g.:
{
"a": 1,
"b",
"c": 3
}
I already have a bunch of pull requests pending here, so I'm not going to submit another one right now, but I thought I should at least mention it since this was originally allowed under non-strict mode.
Let me finish up cleaning up a few bits so all of the previous tests pass again and I will at least put in a link to my fork. I have a lot of changes beyond this (cmake, visual studio, xcode projects, doxygen comments, tests in cmocka, extra json.{c,h} for wrapping jsmn) that don't necessarily go with just this issue and would have to pull the correct pieces for a clean pull request.
@dominickpastore I'm forked over at https://github.com/themobiusproject/jsmn. I am going to start moving pieces to their own branches to try some pull requests.
master, right now, only has two tests left as FIXME and they are non-strict tests which I haven't been looking into. I cleared out some of the original strict type checks as the parser->expected versions covered them.
I also like your idea of setting a flag to avoid strict checks a second time if it has already been done once to find the number of tokens.
The reason for the explicit combinations, though, is those are the actual values that the parser state can have at any given time.
Also, regardless of how long the enum is, it won't touch change compile size or performance, so having each and every option does seem like a good idea; I'll just ignore looking at it in the code.
@dominickpastore I'm forked over at https://github.com/themobiusproject/jsmn. I am going to start moving pieces to their own branches to try some pull requests.
Just want to clarify: if you weren't already planning to create pull requests, don't feel like you have to just because I brought it up. I was just curious.
I was looking at it previously because I like what I have come up with and would be happy to share it with others, especially through the official repo. Since this pushed me to finish the bugs that I had, I was going to get back to it.
I have a quick question regarding the improvements to strict mode: do these changes also fix the following?
{
"array":
[
"missing a comma"
"between these items"
]
}
Even with strict mode enabled, this results in the same tokens as if a comma had been present between the two items. jsonlint.com rejects the JSON above as not being valid. If it's fixed in this PR then great; otherwise, I'll make a separate issue.
@x6herbius, this PR should fix that. That said, I'm not sure if this repository has been abandoned. The maintainers seem to have disappeared. Thus, while I had some other ideas for improvements and even submitted PRs for two of them (#195, #196), I've given up on making further contributions.
I do have a couple more changes I made on my own fork (see valueless-keys
and unsigned-types
branches there), but I don't plan to submit PRs for them unless these existing PRs get accepted.
PR #197 may also be worth checking out. They've made similar sorts of improvements (including proper RFC compliance, I believe) through a different style of implementation.
@dominickpastore I'm still here. I haven't done much as it's been difficult for me due to how big the changes feel. While JSMN is not in the best shape at the moment, it is being dependent on by quite some people and I honestly fear breaking anything just because I did not analyse the changes properly.
To give this repo a bit of life I will try to make a new experimental branch and possibly merge some of the PRs there. Maybe that way it will be possible to get better feedback from community about the changes. And I hope you won't feel like your contributions are a waste.
@pt300 That's good to hear. I was a bit concerned when @themobiusproject and I added some discussion to some of the roadmap issues (#159 and #154) and it seemed to go unnoticed.
I don't want to assume my PRs are the best for the project, either (especially now that there is also PR #197 for a lot of the same stuff). I made the changes because they were what I needed. But they did align with some parts of the roadmap, so I figured I'd submit them to help kickstart the discussion again.
Other thing I would say: This is just my opinion, but it sounds like you are putting too much pressure on yourself. I'd bet, chances are, as long as JSMN can parse standard JSON, 90% of people will be happy. For the rest, it's up to them to speak up if they have a problem with the roadmap. And even if they don't speak up, nothing is forcing them to stop using the current version--that's the beauty of open source.
@dominickpastore @pt300 I am still hanging around here as well.
Both @dominickpastore's PR #194, #195, and #196 and my PR #197 are all valid options. I had once tried to break my PR into smaller changes but continued having chicken and egg problems so all of the changes are lumped into one. I also will comment, @pt300, dominick's and my changes won't merge together as we have both made many fundamental changes that won't jive together (though I know I have swiped some ideas from dominick's code to fix issues in mine).
I certainly don't feel like my contributions have been a waste. There was someone who posted in my PR with a problem that he was having so it seems like someone has been able to benefit what I have done.
it may be useful to note which branch is the most up to date and recommended for new projects, on whatever repository/fork, while issues of backwards compatibility and breaking get resolved.