json
json copied to clipboard
fix iterator_input_adapter consuming input before it should
Before this PR iterator_input_adapter
was indiscriminately consuming/advancing the input iterator before actually know if it should on get_character()
Case example:
-
iterator_input_adapter
over a custom input (ex input adapter over a stream) that can yield on iterator advance: even withstrict=false
theiterator_input_adapter
was trying to consume one item more even if the parser or the sax implementation hadreturn false
-
iterator_input_adapter
over a custom input (ex vector) with multiple json chained together even withstrict=false
theiterator_input_adapter
was consuming one char more than it should even if the parser or the sax implementation hadreturn false
This implementation fix the issue, it advance the iterator only if the current
has ben already consumed, otherwise it wait the next get_character()
to actually advance the iterator
Pull request checklist
Read the Contribution Guidelines for detailed information.
- [x] Changes are described in the pull request
- [x] The test suite compiles and runs without error.
- [ ] Code coverage is 100%. Test cases can be added by editing the test suite.
- [x] The source code is amalgamated; that is, after making changes to the sources in the
include/nlohmann
directory, runmake amalgamate
to create the single-header filesingle_include/nlohmann/json.hpp
. The whole process is described here.
Coverage decreased (-0.03%) to 99.967% when pulling 2614b81092e8afdc1c3eedc2fa53cd2c0ecf3e47 on imwhocodes:iterator_input_adapter_fix into 87cda1d6646592ac5866dc703c8e1839046a6806 on nlohmann:develop.
I assume this is at least somewhat related to this discussion? #3411
Can you construct a unit test that fails with develop
and succeeds after your PR? (You should probably add a unit test anyway.)
I assume this is at least somewhat related to this discussion? https://github.com/nlohmann/json/discussions/3411
It could be, I'm going to check later
Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)
Yes, I was thinking to add some unit-test too after
Should I create another fork of develop
, add the test (that will fail), push, then add on that brach the fixes or the fixes should need to stay on this branch/PR?
Thanks
I assume this is at least somewhat related to this discussion? #3411
It could be, I'm going to check later
Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)
Yes, I was thinking to add some unit-test too after Should I create another fork of
develop
, add the test (that will fail), push, then add on that brach the fixes or the fixes should need to stay on this branch/PR?
For small things like this you can always use Compiler Explorer.
A starting point: https://godbolt.org/z/c83xMcc77
Note that you can #include
http(s) URLs, i.e., use #include <https://raw.githubusercontent.com/nlohmann/json/develop/single_include/nlohmann/json.hpp>
for json.hpp
from the develop
branch.
Edit: To clarify: Add your test here. This is more for you to be sure your test actually demonstrates how your changes fixed something that would previously fail.
@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case
Edit: extended test
@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case
Thanks! That demonstrates the issue clearly enough.
@falbrechtskirchinger @nlohmann should be now strict
exposed as parameter (defaulted to true
) by static basic_json parse( ..... , const bool strict = true)
?
@falbrechtskirchinger @nlohmann should be now
strict
exposed as parameter (defaulted totrue
) bystatic basic_json parse( ..... , const bool strict = true)
?
Does this allow to finally parse JSON lines (https://json.nlohmann.me/features/parsing/json_lines/)?
Does this allow to finally parse JSON lines (https://json.nlohmann.me/features/parsing/json_lines/)?
Sorry can you expand on that? what is not working about json_lines?
If you are talking about something like this there the problem is multiple fold and is not directly corrected by this PR
The problem with friend std::istream& operator>>(std::istream& i, basic_json& j)
are that:
- new line
\n
if present is not consumed at the end of the parsing of json object - parsing of an empty string fail
If you want I can push a PR to address that problem too
[edit]
I prepared a fix, but is going to break unit-regression1.cpp
, NDJson behaviour is no compatible with this tests
[edit2] Proposed solution here
@imwhocodes Can you please check the comments?
Just came back from holiday, gonna address in this days
@imwhocodes Is there anything I can do to support this?
@imwhocodes Do you have a test case that demonstrates this error?
@imwhocodes Do you have a test case that demonstrates this error?
See quote below. I can share my test case which fails to reproduce the issue later.
@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case
Edit: extended test
@imwhocodes Do you have a test case that demonstrates this error?
See quote below. I can share my test case which fails to reproduce the issue later.
@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case Edit: extended test
Thanks. I thought that since you couldn't reproduce it in a test case, we didn't have a working repro yet.
Sorry everybody for the delay, I added simple test case for the fix itself: is almost the same as the one used for the "Compiler Explorer" examples
Thanks for the patience
@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.
Edit: To be crystal clear: I don't believe there's an issue to fix.
@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.
Edit: To be crystal clear: I don't believe there's an issue to fix.
Can you show me your test case?
The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:
-
recv on socket
-
Arduino Stream
- generic input stream
The current (not fixed) implementation of json::details::iterator_input_adapter
actually consume from the underlaying buffer one char
ahead of the one returned to the upper layer, ending up already consuming one char
past the end of the json on when parsing stop
On the link you can see this happening, you can change #define USE_FIX false
on the top to see the difference
Here for any question, Luca
@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.
Can you show me your test case?
The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:
Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case.
I'll need to look at it a bit closer when I have some spare time, meaning not right now.
@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.
Can you show me your test case? The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:
Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case.
I'll need to look at it a bit closer when I have some spare time, meaning not right now.
The logic problem with your test is that you are de-referencing a pointer to the last read char
, and given the fact that you passing strict = true
to the sax_parser
it is correct that the pointer/iterator point to the last char parsed.
When you deference you iterator again to check where you are you are consuming the same char again (the last parsed by the parser)
Here lay the problem, on your test iterator end up were it actually should be, if it would end up one char after the parameter strict = true
would be simply broken
@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.
Can you show me your test case? The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:
Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case. I'll need to look at it a bit closer when I have some spare time, meaning not right now.
The logic problem with your test is that you are de-referencing a pointer to the last read
char
, and given the fact that you passingstrict = true
to thesax_parser
it is correct that the pointer/iterator point to the last char parsed.When you deference you iterator again to check where you are you are consuming the same char again (the last parsed by the parser)
Dereferencing the same iterator more than once should yield the same value, shouldn't it? After all "iterators are generalized pointers" (to quote the standard) and should have similar semantics.
Here lay the problem, on your test iterator end up were it actually should be, if it would end up one char after the parameter
strict = true
would be simply broken
I agree, it ends up where it should end up unless I apply this patch. ;-)
Your code can easily be modified to work correctly with the current iterator_input_adapter
implementation: https://godbolt.org/z/zvsxvPPs5
Why is the current iterator_input_adapter
implementation wrong?
I'm interested in a 4.0 parse()
function that returns the iterator where processing stopped. My test uses the proxy_iterator
class to achieve this behavior with the current API. Your "fix" breaks both the workaround as well as any (future) proper implementation returning the internal iterator from the iterator_input_adapter
.
Thoughts?
If there's something to fix, it's maybe in the parser, and I'd argue that with strict = false
you can achieve your desired behavior either way.
Also, I can't shake my initial impression of this being an unnecessary pessimization for most users (hopefully negligible thanks to modern branch predictors).
@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.
Can you show me your test case? The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:
Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case. I'll need to look at it a bit closer when I have some spare time, meaning not right now.
The logic problem with your test is that you are de-referencing a pointer to the last read
char
, and given the fact that you passingstrict = true
to thesax_parser
it is correct that the pointer/iterator point to the last char parsed. When you deference you iterator again to check where you are you are consuming the same char again (the last parsed by the parser)Dereferencing the same iterator more than once should yield the same value, shouldn't it? After all "iterators are generalized pointers" (to quote the standard) and should have similar semantics.
Here lay the problem, on your test iterator end up were it actually should be, if it would end up one char after the parameter
strict = true
would be simply brokenI agree, it ends up where it should end up unless I apply this patch. ;-)
Your code can easily be modified to work correctly with the current
iterator_input_adapter
implementation: https://godbolt.org/z/zvsxvPPs5Why is the current
iterator_input_adapter
implementation wrong?I'm interested in a 4.0
parse()
function that returns the iterator where processing stopped. My test uses theproxy_iterator
class to achieve this behavior with the current API. Your "fix" breaks both the workaround as well as any (future) proper implementation returning the internal iterator from theiterator_input_adapter
. Thoughts?If there's something to fix, it's maybe in the parser, and I'd argue that with
strict = false
you can achieve your desired behavior either way.Also, I can't shake my initial impression of this being an unnecessary pessimization for most users (hopefully negligible thanks to modern branch predictors).
Got it, yes moving the "consumed
" logic to the upper layer is a solution too, but beware that this problem (the internally advancing the pointer before consuming) affect the https://json.nlohmann.me/api/basic_json/operator_gtgt/ operator too (but the is another bug too on in)
I'm aware of the operator>>
issue and unfortunately, your proposed fix for that has similar problems and breaks existing code.
There're certainly improvements (requiring breaking changes) to be made and I hope 4.0 will do so.
Maybe the operators should avoid strict = true
.
I'm aware of the
operator>>
issue and unfortunately, your proposed fix for that has similar problems and breaks existing code.There're certainly improvements (requiring breaking changes) to be made and I hope 4.0 will do so.
Maybe the operators should avoid
strict = true
.
Yes the operator for sure should not use strict, but the underlaying problem that while parsing one char more is consumed remains
Now I'm trying to write an std::iostream
example but is seem like the input_adapter constructor
for that type was market delete
recently
When I was doing my testing before this PR iterating over a std::iostream
it had the same behaviour of consuming one char more that is should (even if you internally change strict = false
one char more is consumed because the internal pointer advance before it should)
(I have a local branch were I was addressing that issue too)
@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)
You are getting the value when you deference the iterator pointer, but the pointer could be already the iterator::end
So technically UB by deferencing end iterator
Here is a failing test for your example, I had to move from std::string
to std::vector
otherwise there would be the hidden null-terminator
of string masking the bug, as you can see the program crashes
Edit: Also with your proposed solution even strict = true
with correct json input is broken
@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)
Well, this is your code. :-)
I assumed you knew that your operator!=
was broken. There's more work needed, but I'm confident it's possible.
I've taken a closer look and even though I'd recommend changes to both operator!=
and operator++
, the code works just fine with correct JSON input. Did you remove those last few characters on purpose?
...,'u','e','_','B' };
After completing the input as follows the assertion succeedes.
...,'u','e','_','B','"','}' };
And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse
Possibly still not 100% correct but at least closer.
@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)
Well, this is your code. :-) I assumed you knew that your
operator!=
was broken. There's more work needed, but I'm confident it's possible.
I'm honestly thinking that you are trolling me, the broken code is taken from your message, with your proposed solution, this message exactly:
Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5
Code example where you moved the ConsumeChar
logic from EmulateStream & operator++()
to reference operator*() const
reference operator*() const {
if(!consumed) {
c = ConsumeChar(target);
consumed = true;
}
return c;
}
And you proposed it as solution, solution where end iterator is dereferenced
As you can see on my initial test case (here edited to be more clear) the operator!=
is not broken and fully working
I've taken a closer look and even though I'd recommend changes to both
operator!=
andoperator++
, the code works just fine with correct JSON input. Did you remove those last few characters on purpose?...,'u','e','_','B' };
After completing the input as follows the assertion succeedes.
...,'u','e','_','B','"','}' };
Yes I removed on purpose to show you that on incomplete input your solution is UB
About that:
And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse
Possibly still not 100% correct but at least closer.
This is not valid solution, on a STREAM-like input you don't have empty()
or size()
function, so you CANNOT know ahead how much data you have left, it is a stream, a flow like of information, not a container
@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)
Well, this is your code. :-) I assumed you knew that your
operator!=
was broken. There's more work needed, but I'm confident it's possible.I'm honestly thinking that you are trolling me, the broken code is taken from your message, with your proposed solution, this message exactly:
Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5
I modified the pertinent bits of your code to make the unaltered example work. operator!=
is broken in the sense that I can construct two instances of EmulateStream
that return an unexpected result for operator!=
. You can call that UB if you want.
Code example where you moved the
ConsumeChar
logic fromEmulateStream & operator++()
toreference operator*() const
reference operator*() const { if(!consumed) { c = ConsumeChar(target); consumed = true; } return c; }
And you proposed as solution, solution where end iterator is dereferenced I still don't see that happening, sorry.
iterator_input_adapter
does not dereference the end iterator (EmulateStream
).
As you can see on my initial test case (here edited to be more clear) the
operator!=
is not broken and fully working
Yes I removed on purpose to show you that on incomplete input your solution is UB
How? The program is crashing on your assertion, not because the end iterator is dereferenced?
This is not valid solution, on a STREAM-like input you don't have
empty()
orsize()
function, so you CANNOT know ahead how much data you have left, it is a stream, a flow like of information, not a container
Well, the iterator_input_adapter
needs to know, if it's reached the end…
@falbrechtskirchinger ok lets start fresh:
This is your proposed solution: to obtaining the char
value during reference operator*() const
but as you can see here this is broken because you can dereference a ::end
pointer, to make it clear I changed str->front() (UB)
with str->at(0) (Throw)
This is mine original solution: to obtaining the char
value during EmulateStream & operator++()
and then returning it on reference operator*() const
, as you can see here, like this you can learn if it underlying stream is finished without the need of non-existent function like empty()
or size()
, in the example you can see a try/catch to manage/understand the end of stream ( this is an example, it can be a socket returning -1
or EOF
on recv or whatever is the underlaying stream technique of telling you it is finished )
Said that my solution can work only if the upper layer (the parser) don't try to read one char
more than needed, as you can see if on the top line you change #define USE_FIX true
this will use my repo, with false current master, and you will notice on the terminal that more char
are parsed than what they should if using current master
- This happen because in current master
input_adapter
is advanced before actual consuming of the data, so you end up to it pointing onechar
before the lastchar
read. - If the underlying data is a STREAM which is NOT a container (so you don't either have
.size()
,.empty()
or dereference operator) advancing the iterator mean consuming onechar
from the stream itself. If you read the last 2 sentences you can understand where that onechar
is lost
@falbrechtskirchinger ok lets start fresh:
This is your proposed solution: to obtaining the
char
value duringreference operator*() const
but as you can see here this is broken because you can dereference a::end
pointer, to make it clear I changedstr->front() (UB)
withstr->at(0) (Throw)
So, I finally reproduced the dereference you were talking about. You're not using the last code I posted, which includes tweaked operator==
and operator!=
which does avoid the dereference of the end iterator.
This is why we were talking past one another. We were using different code.
Here's the quote with the update:
And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse
Possibly still not 100% correct but at least closer.
@falbrechtskirchinger ok lets start fresh: This is your proposed solution: to obtaining the
char
value duringreference operator*() const
but as you can see here this is broken because you can dereference a::end
pointer, to make it clear I changedstr->front() (UB)
withstr->at(0) (Throw)
So, I finally reproduced the dereference you were talking about. You're not using the last code I posted, which includes tweaked
operator==
andoperator!=
which does avoid the dereference of the end iterator.This is why we were talking past one another. We were using different code.
Here's the quote with the update:
And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse Possibly still not 100% correct but at least closer.
Yes i have seen the code at https://godbolt.org/z/Ee1Pr3Gse and this is why I'm telling you that it is not valid, if we want this to work with a stream (even simply std::iostream
) we can NOT use .empty()
or .size()
member function because they do NOT exist on a stream like input, is it not a container it is a stream, it can me either open or closed, it has no size
Please read carefully this message, everything is explained
I've read your message.
You're trying to shoehorn a stream into the iterator interface. If you insist on parsing JSON directly from a stream, it's on you to find a way to make it work. My point is that the library code is working as intended and your "fix" breaks other valid code. I'm sorry I didn't catch this right at the beginning.
If you want help with your iterator code for a specific stream API (beyond the contrived examples using std::string
and std::vector
), open a discussion. I'm happy to help whenever I have time. There's always some mechanism to tell if a stream is at the end. As a last resort, you can try reading from it, but there's probably a better way.
Let me address one more point of your argument directly:
(...) you will notice on the terminal that more
char
are parsed than what they should if using current master
- This happen because in current master
input_adapter
is advanced before actual consuming of the data, so you end up to it pointing onechar
before the lastchar
read.
Advancing the iterator to the next character should never be a problem. I'll quote the standard to you again: "Iterators are generalized pointers". They are supposed to follow similar semantics. Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.
@nlohmann @gregmarr We've reached a stalemate and need someone else to chime in.
I've read your message.
You're trying to shoehorn a stream into the iterator interface. If you insist on parsing JSON directly from a stream, it's on you to find a way to make it work. My point is that the library code is working as intended and your "fix" breaks other valid code. I'm sorry I didn't catch this right at the beginning.
If you want help with your iterator code for a specific stream API (beyond the contrived examples using
std::string
andstd::vector
), open a discussion. I'm happy to help whenever I have time. There's always some mechanism to tell if a stream is at the end. As a last resort, you can try reading from it, but there's probably a better way.Let me address one more point of your argument directly:
(...) you will notice on the terminal that more
char
are parsed than what they should if using current master
- This happen because in current master
input_adapter
is advanced before actual consuming of the data, so you end up to it pointing onechar
before the lastchar
read.Advancing the iterator to the next character should never be a problem. I'll quote the standard to you again: "Iterators are generalized pointers". They are supposed to follow similar semantics. Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.
@nlohmann @gregmarr We've reached a stalemate and need someone else to chime in.
The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter
implementation one char
is always read ahead, this mean that if you receive one json on a socket you need the first char
of the second to finish parsing the first, and the second one will fail missing it's own first char
IMHO reading on std::advance
is the best way, for sure not during deferencing on the custom adapter, I don't see any problem about it, if it is valid you keep running otherwise you return ::end
and the parser will manage it
About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function
My idea is it that it is better to have the input_adapter
pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in your proxy_adapter
simply do a ++
to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal
Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library
Anyway let see what @nlohmann @gregmarr think, Luca
The real scenario problem it is not about knowing if the stream is closed, but it is that on the current
input_adapter
implementation onechar
is always read ahead, (...)
You can avoid that with strict = false
. See https://godbolt.org/z/Ee1Pr3Gse.
About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function
It's a breaking change and I've demonstrated why. Passing the unit tests does not invalidate that assertion.
My idea is it that it is better to have the
input_adapter
pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in yourproxy_adapter
simply do a++
to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal
You're breaking expected iterator semantics. The past-the-end iterator points past the end because otherwise, we'd have no way to represent an empty range. You're suggestion to "simply do a ++
" would not be sound because of this.
Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library
Nothing prevents you from doing that. You can implement your own iterator to behave as you describe. Just don't break iterator semantics for everyone else because it's convenient for one use case out of many. Any C++ developer, understanding common iterator semantics, can implement your behavior on top of the current implementation without special knowledge about the implementation. Your "fix" changes semantics in a way where that'd no longer be true, as the past-the-end iterator no longer points past the end, but at the end (which is impossible in case of an empty range, creating another problem).
In summary: C++ developers understand iterators, they expect them to follow established rules. The concept of the past-the-end iterator is fundamental, don't mess with that. What you want can be done without modifying the library code. At most, we should improve documentation around the subject.
I rest my case.
The real scenario problem it is not about knowing if the stream is closed, but it is that on the current
input_adapter
implementation onechar
is always read ahead, (...)You can avoid that with
strict = false
. See https://godbolt.org/z/Ee1Pr3Gse.
No, this don't fix the problem, it is not about strict, it is about the parsing before the strict check
About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function
It's a breaking change and I've demonstrated why. Passing the unit tests does not invalidate that assertion.
My idea is it that it is better to have the
input_adapter
pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in yourproxy_adapter
simply do a++
to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposalYou're breaking expected iterator semantics. The past-the-end iterator points past the end because otherwise, we'd have no way to represent an empty range. You're suggestion to "simply do a
++
" would not be sound because of this.
Never said that i want to break the past-the-end iterator, simply that value pointed by iterator when finished parsing should be last parsed char
and not the one after it
Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library
Nothing prevents you from doing that. You can implement your own iterator to behave as you describe. Just don't break iterator semantics for everyone else because it's convenient for one use case out of many. Any C++ developer, understanding common iterator semantics, can implement your behavior on top of the current implementation without special knowledge about the implementation. Your "fix" changes semantics in a way where that'd no longer be true, as the past-the-end iterator no longer points past the end, but at the end (which is impossible in case of an empty range, creating another problem).
Yes, the fact that the upper layer is advancing the iterator one position ahead of the one actually used
In summary: C++ developers understand iterators, they expect them to follow established rules. The concept of the past-the-end iterator is fundamental, don't mess with that. What you want can be done without modifying the library code. At most, we should improve documentation around the subject.
I rest my case.
I reiterate, it is not about breaking iterator, like this library itself say I want to stick to this specification: https://en.cppreference.com/w/cpp/named_req/InputIterator
Explaining as a simple loop where bool parse(char)
return false
when the last }
in json is encountered:
- Current implementation is:
auto it = data.begin();
while( it != data.end() ){
auto value = *it;
it++;
if( ! parse(value) ){
break;
}
}
//Now iterator point to one item after the last actually used
- Proposed implementation is:
auto it = data.begin();
while( it != data.end() ){
auto value = *it;
if( ! parse(value) ){
break;
}
it++;
}
//Now iterator point to the last item that was actually used
Hope that this is clear of where the problem lays
I've looked at the discussion, and thought a lot about this, and done some researching, and I'm still not sure if it's possible to "fix" this use case at this level. I looked at https://en.cppreference.com/w/cpp/iterator/istream_iterator and it seems like it should be possible for the stream iterator itself, rather than the json adapter, to handle this. However, the EmulateStream
in the example seems to do exactly what I expect the stream iterator does.
I don't know how to reconcile the "leaving the iterator pointing at the last thing read" with "the iterator needs to be pointing to the next thing to read in order to start parsing".
On the other hand, there is no visible change as far as the caller of iterator_input_adapter::get_character()
is concerned, and that's the only API that class has.
~There are two callers of that function. One has unget()
support, and the other doesn't. Could the real problem be the lack of unget()
support in binary_reader.hpp
?~
That's JSON vs CBOR, not regular parse vs sax parse, got the two confused, never mind.
@falbrechtskirchinger
Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.
That's what std::istream_iterator
does too.
https://en.cppreference.com/w/cpp/iterator/istream_iterator
std::istream_iterator is a single-pass input iterator that reads successive objects of type T from the std::basic_istream object for which it was constructed, by calling the appropriate operator>>. The actual read operation is performed when the iterator is incremented, not when it is dereferenced. The first object is read when the iterator is constructed. Dereferencing only returns a copy of the most recently read object.
The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, this mean that if you receive one json on a socket you need the first char of the second to finish parsing the first, and the second one will fail missing it's own first char
The unget()
support I mention would help with the second one will fail missing it's own first char
but not you need the first char of the second to finish parsing the first
if the read blocks waiting for the next char.
Having a newline or other whitespace after the last char of the first would actually fix both issues, but that doesn't help if you can't change the source.
@falbrechtskirchinger
Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.
That's what
std::istream_iterator
does too.
The difference being that std::istream_iterator
is designed to be an adapter between streams and iterators. iterator_input_adapter
is an adapter between the parser interface (i.e., get_character()
) and iterators (i.e., NOT streams).
If you want a stream_like_input_adapter
, build one, and let's add that. I've no objections to a solution that doesn't break iterator_input_adapter
's current behavior.
@falbrechtskirchinger I just went looking through the discussions in this PR for it and couldn't find it. Can you remind me again which case is broken with this change?
@falbrechtskirchinger I just went looking through the discussions in this PR for it and couldn't find it. Can you remind me again which case is broken with this change?
See #3548.
Thanks, that makes sense. I hadn't been considering the underlying iterator surviving beyond the adapter.
@falbrechtskirchinger
Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.
That's what
std::istream_iterator
does too.
Another thought: Combining an EmulateStream
that reads on advance with the proxy_iterator
from #3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.
Once again, there are several (external) solutions to this problem. Ideally, we should work towards making input_adapter
part of the public API and allow users to pass an input_adapter
to parse()
directly.
Is some way I cannot understand why we should preserve an internal behaviour that is not defined by the API, on the other side is possible that people is relaying on that hidden/implicit behaviour
What I'm proposing now is to split input_adapter
like this, with the use of some SFINAE and/or std::iterator_traits
:
- Keep current implementation of
input_adapter
but mark it as deprecated - If
user_iterator
is derived fromstd::forward_iterator
use current implementation - If
user_iterator
is derived fromstd::istream_iterator
use proposed implementation
What do you think about it? @gregmarr @falbrechtskirchinger
Combining an EmulateStream that reads on advance with the proxy_iterator from https://github.com/nlohmann/json/pull/3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.
I took a stringstream
, created an istream_iterator
on it, wrapped it in minor modification to the proxy_iterator
that takes the iterator type instead of the container type, and that works. If sax_parse
returned the iterator
at the end, that would eliminate the need for the proxy_iterator
. Since it doesn't, you need the disposable wrapper so you can keep the istream_iterator
that potentially contains the next character around to use it for the next sax_parse
.
https://godbolt.org/z/M5cYfbcbW
Combining an EmulateStream that reads on advance with the proxy_iterator from #3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.
I took a
stringstream
, created anistream_iterator
on it, wrapped it in minor modification to theproxy_iterator
that takes the iterator type instead of the container type, and that works. Ifsax_parse
returned theiterator
at the end, that would eliminate the need for theproxy_iterator
. Since it doesn't, you need the disposable wrapper so you can keep theistream_iterator
that potentially contains the next character around to use it for the nextsax_parse
.https://godbolt.org/z/M5cYfbcbW
Got it, but now you need to always keep the istream_iterator
around in scope, otherwise when it get destructed you lost a char
See https://godbolt.org/z/1f9sGW6cE (btw funny that losing one char
make the json still valid but with a different meaning)
Still more convinced that the previously proposed could be a good solution managing all scenario:
Is some way I cannot understand why we should preserve an internal behaviour that is not defined by the API, on the other side is possible that people is relaying on that hidden/implicit behaviour
What I'm proposing now is to split
input_adapter
like this, with the use of some SFINAE and/orstd::iterator_traits
:
- Keep current implementation of
input_adapter
but mark it as deprecated- If
user_iterator
is derived fromstd::forward_iterator
use current implementation- If
user_iterator
is derived fromstd::istream_iterator
use proposed implementationWhat do you think about it? @gregmarr @falbrechtskirchinger
How should we proceed here?
As is, this PR would conflict with #3548. Unless @imwhocodes wants to try another solution, this can be closed.