Reading past end of JSON data in a stream
While investigating an issue with my changes for better performance and to fix numeric parsing issues, I realized that parsing a number can "eat" the next character in a stream.
The way that the async.jl unit test works, by writing out a stream of JSON values, and then reading them on another process, breaks if there is not another character to separate the values, and just a number is being read.
I attempted to fix this using Base.peek, however on v0.6.2, it doesn't work for the type TCPStream.
I'm not sure what the behavior of a similar test in JavaScript or Python would be.
We should put back any characters we are eating. Is there a reason peek doesn't work on TCPStream?
For one thing, peek isn't even exported, and it is not implemented for TCPStream in v0.6.2.
I haven't tried yet on master.
(The peek implementation looked like it might hurt performance a lot, also, at least one of the methods, that does a try / finally and saves / restores the position in the stream, after reading a single byte)
A better approach in that case would be to only restore the position in the case of reading a JSON number when finding something that isn't a digit (after a -, ., e, E or + (in an exponent), one or more digits are always required).
What I mean is, is there a reason why peek is not implemented (in Base), as it seems this would be a really useful feature overall?
I'm not sure why it was left unexported in master, but at least there it is better supported (instead of just 2 methods,
it has 7, including peek(s::Base.LibuvStream) in Base at stream.jl:1182, which should handle TCPSocket.
(I misremembered the name before, it's TCPSocket instead of TCPStream)
doesn't work for the type TCPStream.
You'll likely also run into trouble with wrappers like SSLContext (almost every HTTP connection is a HTTPS connection nowdays).
To me having a peek that only returns one character is a bit arbitrary. I think it would be more useful to have a peek that returns a view of however many bytes are available in whatever buffer the IO implementation has. e.g. it would be handy to peek in the buffer for presence of e.g. \r\n\r\n before reading.
One character (really, 1 byte) look-ahead is all that the JSON format needs, and that is only for numbers (everything else is terminated by a byte that is known in advance, i.e. '}', ']', ':'or '"').
Having a function that could read the entire buffer of ready to be read bytes would also be quite useful, for other things.
I hit this bug today while accidentally reading some SOAP responses. Having the response wiped while doing a JSON read feels like a bug. In this example the original string is preserved, but in my debugging session, I was dealing with the array output initially not a string.
julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"
julia> array = Vector{UInt8}(body)
28-element Array{UInt8,1}:
0x3c
0x58
0x4d
0x4c
0x3e
0x54
0x68
0x69
0x73
0x20
0x69
0x73
0x20
0x6e
0x6f
0x74
0x20
0x4a
0x53
0x4f
0x4e
0x3c
0x2f
0x58
0x4d
0x4c
0x3e
0x0a
julia> JSON.parse(String(array))
ERROR: Unexpected character
Line: 0
Around: ...<XML>This is not JSON<...
^
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] _error(::String, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:140
[3] parse_jsconstant(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:193
[4] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:170
[5] #parse#1(::Type, ::Type{Int64}, ::Bool, ::Nothing, ::typeof(JSON.Parser.parse), ::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:460
[6] parse(::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:458
[7] top-level scope at REPL[19]:1
julia> array
0-element Array{UInt8,1}
julia> body
"<XML>This is not JSON</XML>\n"
Basically, reading off the end of the array causes the array to get dropped. I would rather it not drop the array.
It looks like this thread here is related.
https://github.com/JuliaLang/julia/issues/24388
How is this related to JSON.jl?
julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"
julia> array = Vector{UInt8}(body);
julia> String(array);
julia> array
UInt8[]