Suspected bug with complex strings

Open nashborges opened this issue 6 years ago • 17 comments

This works fine:

$ echo '{"key":"{blah} blah"}' | genson {"$schema": "http://json-schema.org/schema#", "required": ["key"], "type": "object", "properties": {"key": {"type": "string"}}}

but if there are repeated curly braces within a string, it bombs:

$ echo '{"key":"{blah} {blah}"}' | genson Traceback (most recent call last): File "/usr/local/bin/genson", line 9, in load_entry_point('genson==1.0.1', 'console_scripts', 'genson')() File "/usr/local/lib/python2.7/dist-packages/genson/cli.py", line 25, in main add_json_from_file(builder, object_file, args.delimiter) File "/usr/local/lib/python2.7/dist-packages/genson/cli.py", line 97, in add_json_from_file method(json.loads(json_string)) File "/usr/lib/python2.7/json/init.py", line 339, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 364, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Unterminated string starting at: line 1 column 8 (char 7)

Feb 26 '18 22:02 nashborges

Yes, this thwarts the fairly naïve object boundary detection that genson deploys by default. It just checks for curly braces with their backs to each other. To override this, pass a --delimiter option:

echo '{"key":"{blah} {blah}"}' | genson -d newline

It occurs to me that there's no actual way to turn the delimiter off completely, short of passing it something that doesn't appear in your input text. Perhaps I should add that feature, and perhaps I should turn delimiting off by default. Or perhaps I should just find a more intelligent way to parse the input. Votes for your preferred option are appreciated.

Feb 27 '18 01:02 wolverdude

@wolverdude I had to look at the code to see what exactly this was referring to and I understand now, you use some heuristic to pull multiple object definitions from a single not-json file. Personally, I'd deprecate this functionality unless there's some standard that supports it (it seems very fragile and non-standard). Yaml actually does have support for this (via the '---' indicator), so it should be fine to support in that format once I implement it.

What I did for my use-case was create a wrapper that instead generates the scheme based on "subobjects" so you pass a valid json list of objects and pass a flag to generate the schema to match each one instead of just the whole list. That way you can just use json.parse.

It would be handy and much more robust if this were builtin.

Feb 12 '20 19:02 DylanYoung

To start off the deprecation, you could make --delimiter required for this functionality.

Feb 12 '20 19:02 DylanYoung

I understand this is marked wontfix, and maybe I'm wrong, but I don't actually think it's so hard to fix, even in a streaming fashion.

I would do it with a LALR parser.

You can say e.g. (untested lark format):

object = '{' ~ internals* ~ '}'
internals = quoted_string | bytes+
bytes = /[^{}]/

(Lark has a quoted_string thing already)

By testing for quoted strings first you just drop those

And then you can use a streaming Lark parser, and only consider top-level objects.

What am I missing @wolverdude ?

Jun 28 '22 17:06 ctrlcctrlv

@ctrlcctrlv, thanks for the suggestion! I'm unfamiliar with LALR, but I'll take your word for it. I'd be open to a PR for this, but some things need to be worked out first:

lark would be a sizeable dependency for a niche feature that most people won't use. We should probably support it only with extras_require.
1, true, "str", and [1, 2, 3] are all valid JSON even though they don't contain curly braces. Should we handle these cases? Is there a way to do this with LALR? (Maybe your example does this already; I don't know enough to tell.)

There is no standard way that I'm aware of to pack multiple JSON objects into a single file, so support for that beyond define-your-own-delimiter isn't a strong priority. My current plan is to soft-deprecate the current functionality by making GenSON assume there's only one object in the file unless a --delimiter is passed, and then add "guess" as a possible delimiter value that would use the old, buggy default.

All that said, if you do in fact have a good, simple way to do this, then I see no reason why we can't at least optionally support it. Just be sure to include tests.

Jun 29 '22 04:06 wolverdude

Why is lark a sizable dependency? It has itself no requirements.

Aug 06 '22 02:08 ctrlcctrlv

I've tested this one this time and it meets even your requirement №2. I had a long comment written here but I thought a proof of concept was better: https://github.com/ctrlcctrlv/jsonl-streaming-parser

Input

{"alph{a": 3, "king": {"queen": 3}} {"2": []}  {"id": 1} 23 23 "cap'n crunch" [1,2, 3]

Output (on console)

DEBUG: Buffer status: len 0, removed in last iteration 0, have parsed 0 objects; contents are ``
DEBUG: Buffer status: len 10, removed in last iteration 0, have parsed 0 objects; contents are `{"alph{a":`
DEBUG: Buffer status: len 20, removed in last iteration 0, have parsed 0 objects; contents are `{"alph{a": 3, "king"`
DEBUG: Buffer status: len 30, removed in last iteration 0, have parsed 0 objects; contents are `{"alph{a": 3, "king": {"queen"`
DEBUG: Buffer status: len 40, removed in last iteration 0, have parsed 0 objects; contents are `{"alph{a": 3, "king": {"queen": 3}} {"2"`
INFO: Got {"alph{a": 3, "king": {"queen": 3}}
INFO: Got  
DEBUG: Buffer status: len 14, removed in last iteration 36, have parsed 2 objects; contents are `{"2": []}  {"i`
INFO: Got {"2": []}
INFO: Got   
DEBUG: Buffer status: len 13, removed in last iteration 11, have parsed 4 objects; contents are `{"id": 1} 23 `
INFO: Got {"id": 1}
INFO: Got  
INFO: Got 23
INFO: Got  
DEBUG: Buffer status: len 10, removed in last iteration 13, have parsed 8 objects; contents are `23 "cap'n `
INFO: Got 23
DEBUG: Buffer status: len 18, removed in last iteration 2, have parsed 9 objects; contents are ` "cap'n crunch" [1`
INFO: Got  
INFO: Got "cap'n crunch"
INFO: Got  
WARNING: Last iteration!
DEBUG: Buffer status: len 8, removed in last iteration 16, have parsed 12 objects; contents are `[1,2, 3]`
INFO: Got [1,2, 3]
INFO: Final list of objects:
		{"alph{a": 3, "king": {"queen": 3}},
		{"2": []},
		{"id": 1},
		23,
		23,
		"cap'n crunch",
		[1,2, 3]

There are several changes @erezsh could consider making in Lark to make streaming parsers easier to handle. At present I just tear down the parser repeatedly. A non-naive implementation should rotate JsonInterpreter.json_split as well.

Aug 06 '22 09:08 ctrlcctrlv

Also, you seem to rely on the default json library a lot so I didn't bother writing a full JSON parser, some invalid strings will work so you should still pass the results of JsonInterpreter.json_objects() through it. (Cf. https://github.com/ctrlcctrlv/jsonl-streaming-parser/commit/20f22f0f8c0e88ab55615a92e7126898c81ceeab)

The parser isn't quite naïve, just not correct since its job is splitting more than it is parsing:

start: jsonl+
jsonl: internals
object: "{" internals* "}"
list: "[" internals* "]"
internals: (ESCAPED_STRING | BYTES | WS | object | list)
BYTES: _BYTE+
_BYTE: /[^{}\[\]\s"]/

%import common.ESCAPED_STRING
%import common.WS

Aug 06 '22 09:08 ctrlcctrlv

@ctrlcctrlv Not sure why you pinged me. Can you sum it up for me?

Aug 06 '22 09:08 erezsh

@erezsh Oh, yes.

https://github.com/ctrlcctrlv/jsonl-streaming-parser/blob/20f22f0f8c0e88ab55615a92e7126898c81ceeab/main.py#L34

In summary, if there were a way for me to tell Lark that I am only interested in the results my transformer is storing (which I can e.g. auto-rotate), then Lark could natively support streaming parsers via its transformer mechanism.

At present I have to use its ability to cache the grammar and continually recreate the parser (but not the transformer):

https://github.com/ctrlcctrlv/jsonl-streaming-parser/blob/20f22f0f8c0e88ab55615a92e7126898c81ceeab/jsonl.py#L19

Aug 06 '22 09:08 ctrlcctrlv

There is no standard way that I'm aware of to pack multiple JSON objects into a single file, so support for that beyond define-your-own-delimiter isn't a strong priority. My current plan is to soft-deprecate the current functionality by making GenSON assume there's only one object in the file unless a --delimiter is passed, and then add "guess" as a possible delimiter value that would use the old, buggy default.

The good thing about my implementation is that --delimiter can be removed entirely, it's no longer needed at all. Consider the change I just made in https://github.com/ctrlcctrlv/jsonl-streaming-parser/commit/96e9d9fd29b60e7d7b869e712efb2bf6172f6b18, which allows this JSON:

{}{"alph{a": 3, "king": {"queen": 3}} [][]""{} {"2": []}  {"id": 1} 23 23 "cap\"n crunch" [1,2, 3] []""3{}2.2""null

To be parsed as:

[
		{},
		{"alph{a": 3, "king": {"queen": 3}},
		[],
		[],
		"",
		{},
		{"2": []},
		{"id": 1},
		23,
		23,
		"cap\"n crunch",
		[1,2, 3],
		[],
		"",
		3,
		{},
		2.2,
		"",
		null
]

DEBUG: Buffer status: len 0, removed in last iteration 0, have parsed 0 objects; contents are ``
DEBUG: Buffer status: len 10, removed in last iteration 0, have parsed 0 objects; contents are `{}{"alph{a`
INFO: Got {}
DEBUG: Buffer status: len 18, removed in last iteration 2, have parsed 1 objects; contents are `{"alph{a": 3, "kin`
DEBUG: Buffer status: len 28, removed in last iteration 2, have parsed 1 objects; contents are `{"alph{a": 3, "king": {"quee`
DEBUG: Buffer status: len 38, removed in last iteration 2, have parsed 1 objects; contents are `{"alph{a": 3, "king": {"queen": 3}} []`
INFO: Got {"alph{a": 3, "king": {"queen": 3}}
INFO: Got  
INFO: Got []
DEBUG: Buffer status: len 10, removed in last iteration 38, have parsed 4 objects; contents are `[]""{} {"2`
INFO: Got []
INFO: Got ""
INFO: Got {}
INFO: Got  
DEBUG: Buffer status: len 13, removed in last iteration 7, have parsed 8 objects; contents are `{"2": []}  {"`
INFO: Got {"2": []}
INFO: Got   
DEBUG: Buffer status: len 12, removed in last iteration 11, have parsed 10 objects; contents are `{"id": 1} 23`
INFO: Got {"id": 1}
INFO: Got  
INFO: Got 23
DEBUG: Buffer status: len 10, removed in last iteration 12, have parsed 13 objects; contents are ` 23 "cap\"`
INFO: Got  
INFO: Got 23
DEBUG: Buffer status: len 17, removed in last iteration 3, have parsed 15 objects; contents are ` "cap\"n crunch" `
INFO: Got  
INFO: Got "cap\"n crunch"
INFO: Got  
DEBUG: Buffer status: len 10, removed in last iteration 17, have parsed 18 objects; contents are `[1,2, 3] [`
INFO: Got [1,2, 3]
INFO: Got  
DEBUG: Buffer status: len 11, removed in last iteration 9, have parsed 20 objects; contents are `[]""3{}2.2"`
INFO: Got []
INFO: Got ""
INFO: Got 3
INFO: Got {}
WARNING: Last iteration!
DEBUG: Buffer status: len 9, removed in last iteration 7, have parsed 24 objects; contents are `2.2""null`
INFO: Got 2.2
INFO: Got ""
INFO: Got null
INFO: Final list of objects:
		{},
		{"alph{a": 3, "king": {"queen": 3}},
		[],
		[],
		"",
		{},
		{"2": []},
		{"id": 1},
		23,
		23,
		"cap\"n crunch",
		[1,2, 3],
		[],
		"",
		3,
		{},
		2.2,
		"",
		null,

Aug 06 '22 09:08 ctrlcctrlv

This is great! Since you already created a general-purpose JSON-parsing library, go ahead and package it up, and we can add it as a dependency. We can make the default delimiter None unless your Lark library has been installed, in which case it will use that by default.

Aug 06 '22 23:08 wolverdude

I haven't seen any movement on this. @ctrlcctrlv, I think your library could be useful as an independent PyPI package. Just add docs, manifests, and preferably some tests. https://packaging.python.org/en/latest/tutorials/packaging-projects/

Once that is done, I'll add it as an optional dependency. Failing that, I will change the default behavior in the next version, but it will just always default to no delimiter. It won't have this nice optional feature.

Dec 20 '22 16:12 wolverdude

Sorry. I know how to package Python code and maintain several PyPI packages, that's not the issue. The issue is I lost interest in the problem because I probably want to rewrite GenSON in Rust some day and extend it to generate SQL which is what I use it for anyway because better/jsonschema2db is kinda awful and hacky and very hard for me to extend. And it's unmaintained (apparently "Better" engineering does not include maintenance, har har).

I even have a cute name for it: NoNoSQL. I asked the @surrealdb guys about it, specifically @tobiemh, but he was a bit too busy to engage w/the idea lol. I can understand why, replacing PostgreSQL is a big task and my silly NoNoSQL idea probably below the bottom of the priority list.

You don't want to use my code in production as it stands because it never pops objects out of JsonInterpreter._json_objects. Meaning, the entire JSON stream read on stdin gets split into JSON strings which remain resident in memory forever. Now, your existing implementation may be doing the same and if so that's bad, but I wouldn't give you such PoC code in production haha.

This is probably fixable. But I won't be the one to do it in Python as Python is not even the right language for this very computationally expensive problem (schema generation and SQL conversion) anyway, which also happens to be a strongly-typed problem when you bring it into SQL land thus why Rust seems attractive to me.

Nevertheless you may edit my PoC as you wish. And if I ever finish NoNoSQL I'll let you know know ba dum tiss

Dec 21 '22 07:12 ctrlcctrlv

Oh, also, to make this not quite a hard "no", if @erezsh considers doing what I said above to make Lark better for streaming parsers so I don't have to constantly tear down and recreate the grammar (another blocking issue performance-wise) then I will consider packaging my JSON-L parser for real because it'll then be much easier (and more logical) for me to auto-rotate the parsed objects only keeping a count of what we've seen. Even if I end up replacing GenSON for the things I use it for (which is really just a symptom of needing to replace jsonschema2db) I can see how a fast streaming JSON-L parser would be useful to projects long-term and prevent many hacky implementations such as the one in GenSON.

I just don't like encouraging people to use code I know has serious issues however well it may work on small examples.

Dec 21 '22 07:12 ctrlcctrlv

@ctrlcctrlv Have you opened an issue in Lark to discuss these changes that you're proposing?

Also, are you aware of this solution? https://github.com/lark-parser/lark/issues/488#issuecomment-1235162911

Dec 21 '22 14:12 erezsh

I wasn't, but it seems like it would work :)

Dec 21 '22 14:12 ctrlcctrlv

GenSON GenSON copied to clipboard

Suspected bug with complex strings

Input

Output (on console)

GenSON
GenSON copied to clipboard