JSON.jl icon indicating copy to clipboard operation
JSON.jl copied to clipboard

Reading past end of JSON data in a stream

Open ScottPJones opened this issue 7 years ago • 8 comments

While investigating an issue with my changes for better performance and to fix numeric parsing issues, I realized that parsing a number can "eat" the next character in a stream. The way that the async.jl unit test works, by writing out a stream of JSON values, and then reading them on another process, breaks if there is not another character to separate the values, and just a number is being read. I attempted to fix this using Base.peek, however on v0.6.2, it doesn't work for the type TCPStream. I'm not sure what the behavior of a similar test in JavaScript or Python would be.

ScottPJones avatar Feb 09 '18 16:02 ScottPJones

We should put back any characters we are eating. Is there a reason peek doesn't work on TCPStream?

TotalVerb avatar Feb 10 '18 02:02 TotalVerb

For one thing, peek isn't even exported, and it is not implemented for TCPStream in v0.6.2. I haven't tried yet on master.

(The peek implementation looked like it might hurt performance a lot, also, at least one of the methods, that does a try / finally and saves / restores the position in the stream, after reading a single byte) A better approach in that case would be to only restore the position in the case of reading a JSON number when finding something that isn't a digit (after a -, ., e, E or + (in an exponent), one or more digits are always required).

ScottPJones avatar Feb 10 '18 16:02 ScottPJones

What I mean is, is there a reason why peek is not implemented (in Base), as it seems this would be a really useful feature overall?

TotalVerb avatar Feb 11 '18 04:02 TotalVerb

I'm not sure why it was left unexported in master, but at least there it is better supported (instead of just 2 methods, it has 7, including peek(s::Base.LibuvStream) in Base at stream.jl:1182, which should handle TCPSocket. (I misremembered the name before, it's TCPSocket instead of TCPStream)

ScottPJones avatar Feb 11 '18 11:02 ScottPJones

doesn't work for the type TCPStream.

You'll likely also run into trouble with wrappers like SSLContext (almost every HTTP connection is a HTTPS connection nowdays).

To me having a peek that only returns one character is a bit arbitrary. I think it would be more useful to have a peek that returns a view of however many bytes are available in whatever buffer the IO implementation has. e.g. it would be handy to peek in the buffer for presence of e.g. \r\n\r\n before reading.

samoconnor avatar Feb 18 '18 09:02 samoconnor

One character (really, 1 byte) look-ahead is all that the JSON format needs, and that is only for numbers (everything else is terminated by a byte that is known in advance, i.e. '}', ']', ':'or '"'). Having a function that could read the entire buffer of ready to be read bytes would also be quite useful, for other things.

ScottPJones avatar Feb 19 '18 21:02 ScottPJones

I hit this bug today while accidentally reading some SOAP responses. Having the response wiped while doing a JSON read feels like a bug. In this example the original string is preserved, but in my debugging session, I was dealing with the array output initially not a string.

julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"

julia> array = Vector{UInt8}(body)
28-element Array{UInt8,1}:
 0x3c
 0x58
 0x4d
 0x4c
 0x3e
 0x54
 0x68
 0x69
 0x73
 0x20
 0x69
 0x73
 0x20
 0x6e
 0x6f
 0x74
 0x20
 0x4a
 0x53
 0x4f
 0x4e
 0x3c
 0x2f
 0x58
 0x4d
 0x4c
 0x3e
 0x0a

julia> JSON.parse(String(array))
ERROR: Unexpected character
Line: 0
Around: ...<XML>This is not JSON<...
            ^

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] _error(::String, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:140
 [3] parse_jsconstant(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:193
 [4] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:170
 [5] #parse#1(::Type, ::Type{Int64}, ::Bool, ::Nothing, ::typeof(JSON.Parser.parse), ::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:460
 [6] parse(::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:458
 [7] top-level scope at REPL[19]:1

julia> array
0-element Array{UInt8,1}

julia> body
"<XML>This is not JSON</XML>\n"

Basically, reading off the end of the array causes the array to get dropped. I would rather it not drop the array.

It looks like this thread here is related.

https://github.com/JuliaLang/julia/issues/24388

peteristhegreat avatar Dec 04 '20 20:12 peteristhegreat

How is this related to JSON.jl?

julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"

julia> array = Vector{UInt8}(body);

julia> String(array);

julia> array
UInt8[]

KristofferC avatar Dec 05 '20 10:12 KristofferC