json fix iterator_input_adapter consuming input before it should

Before this PR iterator_input_adapter was indiscriminately consuming/advancing the input iterator before actually know if it should on get_character()

Case example:

iterator_input_adapter over a custom input (ex input adapter over a stream) that can yield on iterator advance: even with strict=false the iterator_input_adapter was trying to consume one item more even if the parser or the sax implementation had return false
iterator_input_adapter over a custom input (ex vector) with multiple json chained together even with strict=false the iterator_input_adapter was consuming one char more than it should even if the parser or the sax implementation had return false

This implementation fix the issue, it advance the iterator only if the current has ben already consumed, otherwise it wait the next get_character() to actually advance the iterator

Pull request checklist

Read the Contribution Guidelines for detailed information.

[x] Changes are described in the pull request
[x] The test suite compiles and runs without error.
[ ] Code coverage is 100%. Test cases can be added by editing the test suite.
[x] The source code is amalgamated; that is, after making changes to the sources in the include/nlohmann directory, run make amalgamate to create the single-header file single_include/nlohmann/json.hpp. The whole process is described here.

Apr 14 '22 11:04 imwhocodes

Coverage decreased (-0.03%) to 99.967% when pulling 2614b81092e8afdc1c3eedc2fa53cd2c0ecf3e47 on imwhocodes:iterator_input_adapter_fix into 87cda1d6646592ac5866dc703c8e1839046a6806 on nlohmann:develop.

Apr 14 '22 12:04 coveralls

I assume this is at least somewhat related to this discussion? #3411

Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)

Apr 14 '22 12:04 falbrechtskirchinger

I assume this is at least somewhat related to this discussion? https://github.com/nlohmann/json/discussions/3411

It could be, I'm going to check later

Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)

Yes, I was thinking to add some unit-test too after Should I create another fork of develop, add the test (that will fail), push, then add on that brach the fixes or the fixes should need to stay on this branch/PR?

Thanks

Apr 14 '22 12:04 imwhocodes

I assume this is at least somewhat related to this discussion? #3411

It could be, I'm going to check later

Can you construct a unit test that fails with develop and succeeds after your PR? (You should probably add a unit test anyway.)

Yes, I was thinking to add some unit-test too after Should I create another fork of develop, add the test (that will fail), push, then add on that brach the fixes or the fixes should need to stay on this branch/PR?

For small things like this you can always use Compiler Explorer.

A starting point: https://godbolt.org/z/c83xMcc77

Note that you can #include http(s) URLs, i.e., use #include <https://raw.githubusercontent.com/nlohmann/json/develop/single_include/nlohmann/json.hpp> for json.hpp from the develop branch.

Edit: To clarify: Add your test here. This is more for you to be sure your test actually demonstrates how your changes fixed something that would previously fail.

Apr 14 '22 12:04 falbrechtskirchinger

@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case

Edit: extended test

Apr 14 '22 18:04 imwhocodes

@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case

Thanks! That demonstrates the issue clearly enough.

Apr 14 '22 18:04 falbrechtskirchinger

@falbrechtskirchinger @nlohmann should be now strict exposed as parameter (defaulted to true) by static basic_json parse( ..... , const bool strict = true) ?

Apr 14 '22 20:04 imwhocodes

@falbrechtskirchinger @nlohmann should be now strict exposed as parameter (defaulted to true) by static basic_json parse( ..... , const bool strict = true) ?

Does this allow to finally parse JSON lines (https://json.nlohmann.me/features/parsing/json_lines/)?

Apr 14 '22 20:04 nlohmann

Does this allow to finally parse JSON lines (https://json.nlohmann.me/features/parsing/json_lines/)?

Sorry can you expand on that? what is not working about json_lines?

If you are talking about something like this there the problem is multiple fold and is not directly corrected by this PR

The problem with friend std::istream& operator>>(std::istream& i, basic_json& j) are that:

new line \n if present is not consumed at the end of the parsing of json object
parsing of an empty string fail

If you want I can push a PR to address that problem too

[edit] I prepared a fix, but is going to break unit-regression1.cpp, NDJson behaviour is no compatible with this tests

[edit2] Proposed solution here

Apr 14 '22 21:04 imwhocodes

@imwhocodes Can you please check the comments?

May 01 '22 13:05 nlohmann

Just came back from holiday, gonna address in this days

May 02 '22 09:05 imwhocodes

@imwhocodes Is there anything I can do to support this?

May 20 '22 09:05 nlohmann

@imwhocodes Do you have a test case that demonstrates this error?

Jun 14 '22 17:06 gregmarr

@imwhocodes Do you have a test case that demonstrates this error?

See quote below. I can share my test case which fails to reproduce the issue later.

@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case

Edit: extended test

Jun 14 '22 18:06 falbrechtskirchinger

@imwhocodes Do you have a test case that demonstrates this error?

See quote below. I can share my test case which fails to reproduce the issue later.

@falbrechtskirchinger Here is a link to Compiler Explorer with an example of failure case Edit: extended test

Thanks. I thought that since you couldn't reproduce it in a test case, we didn't have a working repro yet.

Jun 14 '22 18:06 gregmarr

Sorry everybody for the delay, I added simple test case for the fix itself: is almost the same as the one used for the "Compiler Explorer" examples

Thanks for the patience

Jun 20 '22 10:06 imwhocodes

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.

Edit: To be crystal clear: I don't believe there's an issue to fix.

Jun 20 '22 11:06 falbrechtskirchinger

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right.

Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case?

The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

recv on socket
Arduino Stream
generic input stream

The current (not fixed) implementation of json::details::iterator_input_adapter actually consume from the underlaying buffer one char ahead of the one returned to the upper layer, ending up already consuming one char past the end of the json on when parsing stop

On the link you can see this happening, you can change #define USE_FIX false on the top to see the difference

Here for any question, Luca

Jun 20 '22 11:06 imwhocodes

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case?

The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case.

I'll need to look at it a bit closer when I have some spare time, meaning not right now.

Jun 20 '22 11:06 falbrechtskirchinger

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case? The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case.

I'll need to look at it a bit closer when I have some spare time, meaning not right now.

The logic problem with your test is that you are de-referencing a pointer to the last read char, and given the fact that you passing strict = true to the sax_parser it is correct that the pointer/iterator point to the last char parsed.

When you deference you iterator again to check where you are you are consuming the same char again (the last parsed by the parser)

Here lay the problem, on your test iterator end up were it actually should be, if it would end up one char after the parameter strict = true would be simply broken

Jun 20 '22 11:06 imwhocodes

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case? The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case. I'll need to look at it a bit closer when I have some spare time, meaning not right now.

The logic problem with your test is that you are de-referencing a pointer to the last read char, and given the fact that you passing strict = true to the sax_parser it is correct that the pointer/iterator point to the last char parsed.

When you deference you iterator again to check where you are you are consuming the same char again (the last parsed by the parser)

Dereferencing the same iterator more than once should yield the same value, shouldn't it? After all "iterators are generalized pointers" (to quote the standard) and should have similar semantics.

Here lay the problem, on your test iterator end up were it actually should be, if it would end up one char after the parameter strict = true would be simply broken

I agree, it ends up where it should end up unless I apply this patch. ;-)

Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5

Why is the current iterator_input_adapter implementation wrong?

I'm interested in a 4.0 parse() function that returns the iterator where processing stopped. My test uses the proxy_iterator class to achieve this behavior with the current API. Your "fix" breaks both the workaround as well as any (future) proper implementation returning the internal iterator from the iterator_input_adapter. Thoughts?

If there's something to fix, it's maybe in the parser, and I'd argue that with strict = false you can achieve your desired behavior either way.

Also, I can't shake my initial impression of this being an unnecessary pessimization for most users (hopefully negligible thanks to modern branch predictors).

Jun 20 '22 13:06 falbrechtskirchinger

@imwhocodes I could not independently write a test that would trigger your issue. Guess I should make that available so we can compare and figure out who's right. Edit: To be crystal clear: I don't believe there's an issue to fix.

Can you show me your test case? The situation where the current (not fixed) implementation fail is on an input iterator where the action of reading actually remove from the underlaying buffer, ex:

Ok, I understand the problem now. It's about advancing the iterator vs. dereferencing it. You can look at my test case in PR #3531. Either way, your fix breaks my use case. I'll need to look at it a bit closer when I have some spare time, meaning not right now.

The logic problem with your test is that you are de-referencing a pointer to the last read char, and given the fact that you passing strict = true to the sax_parser it is correct that the pointer/iterator point to the last char parsed. When you deference you iterator again to check where you are you are consuming the same char again (the last parsed by the parser)

Dereferencing the same iterator more than once should yield the same value, shouldn't it? After all "iterators are generalized pointers" (to quote the standard) and should have similar semantics.

Here lay the problem, on your test iterator end up were it actually should be, if it would end up one char after the parameter strict = true would be simply broken

I agree, it ends up where it should end up unless I apply this patch. ;-)

Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5

Why is the current iterator_input_adapter implementation wrong?

I'm interested in a 4.0 parse() function that returns the iterator where processing stopped. My test uses the proxy_iterator class to achieve this behavior with the current API. Your "fix" breaks both the workaround as well as any (future) proper implementation returning the internal iterator from the iterator_input_adapter. Thoughts?

If there's something to fix, it's maybe in the parser, and I'd argue that with strict = false you can achieve your desired behavior either way.

Also, I can't shake my initial impression of this being an unnecessary pessimization for most users (hopefully negligible thanks to modern branch predictors).

Got it, yes moving the "consumed" logic to the upper layer is a solution too, but beware that this problem (the internally advancing the pointer before consuming) affect the https://json.nlohmann.me/api/basic_json/operator_gtgt/ operator too (but the is another bug too on in)

Jun 20 '22 14:06 imwhocodes

I'm aware of the operator>> issue and unfortunately, your proposed fix for that has similar problems and breaks existing code.

There're certainly improvements (requiring breaking changes) to be made and I hope 4.0 will do so.

Maybe the operators should avoid strict = true.

Jun 20 '22 14:06 falbrechtskirchinger

I'm aware of the operator>> issue and unfortunately, your proposed fix for that has similar problems and breaks existing code.

There're certainly improvements (requiring breaking changes) to be made and I hope 4.0 will do so.

Maybe the operators should avoid strict = true.

Yes the operator for sure should not use strict, but the underlaying problem that while parsing one char more is consumed remains

Now I'm trying to write an std::iostream example but is seem like the input_adapter constructor for that type was market delete recently When I was doing my testing before this PR iterating over a std::iostream it had the same behaviour of consuming one char more that is should (even if you internally change strict = false one char more is consumed because the internal pointer advance before it should)

(I have a local branch were I was addressing that issue too)

Jun 20 '22 14:06 imwhocodes

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

You are getting the value when you deference the iterator pointer, but the pointer could be already the iterator::end So technically UB by deferencing end iterator

Here is a failing test for your example, I had to move from std::string to std::vector otherwise there would be the hidden null-terminator of string masking the bug, as you can see the program crashes

Edit: Also with your proposed solution even strict = true with correct json input is broken

Jun 20 '22 15:06 imwhocodes

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

Well, this is your code. :-) I assumed you knew that your operator!= was broken. There's more work needed, but I'm confident it's possible.

Jun 20 '22 16:06 falbrechtskirchinger

I've taken a closer look and even though I'd recommend changes to both operator!= and operator++, the code works just fine with correct JSON input. Did you remove those last few characters on purpose?

...,'u','e','_','B' };

After completing the input as follows the assertion succeedes.

...,'u','e','_','B','"','}' };

Jun 20 '22 16:06 falbrechtskirchinger

And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse

Possibly still not 100% correct but at least closer.

Jun 20 '22 16:06 falbrechtskirchinger

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

Well, this is your code. :-) I assumed you knew that your operator!= was broken. There's more work needed, but I'm confident it's possible.

I'm honestly thinking that you are trolling me, the broken code is taken from your message, with your proposed solution, this message exactly:

Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5

Code example where you moved the ConsumeChar logic from EmulateStream & operator++() to reference operator*() const

    reference operator*() const {
        if(!consumed) {
            c = ConsumeChar(target);
            consumed = true;
        }
        return c;
    }

And you proposed it as solution, solution where end iterator is dereferenced

As you can see on my initial test case (here edited to be more clear) the operator!= is not broken and fully working

Jun 20 '22 17:06 imwhocodes

I've taken a closer look and even though I'd recommend changes to both operator!= and operator++, the code works just fine with correct JSON input. Did you remove those last few characters on purpose?
...,'u','e','_','B' };
After completing the input as follows the assertion succeedes.
...,'u','e','_','B','"','}' };

Yes I removed on purpose to show you that on incomplete input your solution is UB

About that:

And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse

Possibly still not 100% correct but at least closer.

This is not valid solution, on a STREAM-like input you don't have empty() or size() function, so you CANNOT know ahead how much data you have left, it is a stream, a flow like of information, not a container

Jun 20 '22 17:06 imwhocodes

@falbrechtskirchinger sorry but this example that you give me is broken too (undefined behaviour)

Well, this is your code. :-) I assumed you knew that your operator!= was broken. There's more work needed, but I'm confident it's possible.

I'm honestly thinking that you are trolling me, the broken code is taken from your message, with your proposed solution, this message exactly:

Your code can easily be modified to work correctly with the current iterator_input_adapter implementation: https://godbolt.org/z/zvsxvPPs5

I modified the pertinent bits of your code to make the unaltered example work. operator!= is broken in the sense that I can construct two instances of EmulateStream that return an unexpected result for operator!=. You can call that UB if you want.

Code example where you moved the ConsumeChar logic from EmulateStream & operator++() to reference operator*() const
    reference operator*() const {
        if(!consumed) {
            c = ConsumeChar(target);
            consumed = true;
        }
        return c;
    }
And you proposed as solution, solution where end iterator is dereferenced I still don't see that happening, sorry. iterator_input_adapter does not dereference the end iterator (EmulateStream).

As you can see on my initial test case (here edited to be more clear) the operator!= is not broken and fully working

Jun 20 '22 17:06 falbrechtskirchinger

Yes I removed on purpose to show you that on incomplete input your solution is UB

How? The program is crashing on your assertion, not because the end iterator is dereferenced?

This is not valid solution, on a STREAM-like input you don't have empty() or size() function, so you CANNOT know ahead how much data you have left, it is a stream, a flow like of information, not a container

Well, the iterator_input_adapter needs to know, if it's reached the end…

Jun 20 '22 17:06 falbrechtskirchinger

@falbrechtskirchinger ok lets start fresh:

This is your proposed solution: to obtaining the char value during reference operator*() const but as you can see here this is broken because you can dereference a ::end pointer, to make it clear I changed str->front() (UB) with str->at(0) (Throw)

This is mine original solution: to obtaining the char value during EmulateStream & operator++() and then returning it on reference operator*() const, as you can see here, like this you can learn if it underlying stream is finished without the need of non-existent function like empty() or size(), in the example you can see a try/catch to manage/understand the end of stream ( this is an example, it can be a socket returning -1 or EOF on recv or whatever is the underlaying stream technique of telling you it is finished )

Said that my solution can work only if the upper layer (the parser) don't try to read one char more than needed, as you can see if on the top line you change #define USE_FIX true this will use my repo, with false current master, and you will notice on the terminal that more char are parsed than what they should if using current master

This happen because in current master input_adapter is advanced before actual consuming of the data, so you end up to it pointing one char before the last char read.
If the underlying data is a STREAM which is NOT a container (so you don't either have .size(), .empty() or dereference operator) advancing the iterator mean consuming one char from the stream itself. If you read the last 2 sentences you can understand where that one char is lost

Jun 20 '22 17:06 imwhocodes

@falbrechtskirchinger ok lets start fresh:

This is your proposed solution: to obtaining the char value during reference operator*() const but as you can see here this is broken because you can dereference a ::end pointer, to make it clear I changed str->front() (UB) with str->at(0) (Throw)

So, I finally reproduced the dereference you were talking about. You're not using the last code I posted, which includes tweaked operator== and operator!= which does avoid the dereference of the end iterator.

This is why we were talking past one another. We were using different code.

Here's the quote with the update:

And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse

Possibly still not 100% correct but at least closer.

Jun 20 '22 18:06 falbrechtskirchinger

@falbrechtskirchinger ok lets start fresh: This is your proposed solution: to obtaining the char value during reference operator*() const but as you can see here this is broken because you can dereference a ::end pointer, to make it clear I changed str->front() (UB) with str->at(0) (Throw)

So, I finally reproduced the dereference you were talking about. You're not using the last code I posted, which includes tweaked operator== and operator!= which does avoid the dereference of the end iterator.

This is why we were talking past one another. We were using different code.

Here's the quote with the update:

And here's the updated code with some improvements to the operators: https://godbolt.org/z/Ee1Pr3Gse Possibly still not 100% correct but at least closer.

Yes i have seen the code at https://godbolt.org/z/Ee1Pr3Gse and this is why I'm telling you that it is not valid, if we want this to work with a stream (even simply std::iostream) we can NOT use .empty() or .size() member function because they do NOT exist on a stream like input, is it not a container it is a stream, it can me either open or closed, it has no size

Please read carefully this message, everything is explained

Jun 20 '22 18:06 imwhocodes

I've read your message.

You're trying to shoehorn a stream into the iterator interface. If you insist on parsing JSON directly from a stream, it's on you to find a way to make it work. My point is that the library code is working as intended and your "fix" breaks other valid code. I'm sorry I didn't catch this right at the beginning.

If you want help with your iterator code for a specific stream API (beyond the contrived examples using std::string and std::vector), open a discussion. I'm happy to help whenever I have time. There's always some mechanism to tell if a stream is at the end. As a last resort, you can try reading from it, but there's probably a better way.

Let me address one more point of your argument directly:

(...) you will notice on the terminal that more char are parsed than what they should if using current master

This happen because in current master input_adapter is advanced before actual consuming of the data, so you end up to it pointing one char before the last char read.

Advancing the iterator to the next character should never be a problem. I'll quote the standard to you again: "Iterators are generalized pointers". They are supposed to follow similar semantics. Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

@nlohmann @gregmarr We've reached a stalemate and need someone else to chime in.

Jun 20 '22 19:06 falbrechtskirchinger

I've read your message.

You're trying to shoehorn a stream into the iterator interface. If you insist on parsing JSON directly from a stream, it's on you to find a way to make it work. My point is that the library code is working as intended and your "fix" breaks other valid code. I'm sorry I didn't catch this right at the beginning.

If you want help with your iterator code for a specific stream API (beyond the contrived examples using std::string and std::vector), open a discussion. I'm happy to help whenever I have time. There's always some mechanism to tell if a stream is at the end. As a last resort, you can try reading from it, but there's probably a better way.

Let me address one more point of your argument directly:

(...) you will notice on the terminal that more char are parsed than what they should if using current master

This happen because in current master input_adapter is advanced before actual consuming of the data, so you end up to it pointing one char before the last char read.

Advancing the iterator to the next character should never be a problem. I'll quote the standard to you again: "Iterators are generalized pointers". They are supposed to follow similar semantics. Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

@nlohmann @gregmarr We've reached a stalemate and need someone else to chime in.

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, this mean that if you receive one json on a socket you need the first char of the second to finish parsing the first, and the second one will fail missing it's own first char IMHO reading on std::advance is the best way, for sure not during deferencing on the custom adapter, I don't see any problem about it, if it is valid you keep running otherwise you return ::end and the parser will manage it

About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function

My idea is it that it is better to have the input_adapter pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in your proxy_adapter simply do a ++ to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal

Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library

Anyway let see what @nlohmann @gregmarr think, Luca

Jun 20 '22 20:06 imwhocodes

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, (...)

You can avoid that with strict = false. See https://godbolt.org/z/Ee1Pr3Gse.

About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function

It's a breaking change and I've demonstrated why. Passing the unit tests does not invalidate that assertion.

My idea is it that it is better to have the input_adapter pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in your proxy_adapter simply do a ++ to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal

You're breaking expected iterator semantics. The past-the-end iterator points past the end because otherwise, we'd have no way to represent an empty range. You're suggestion to "simply do a ++" would not be sound because of this.

Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library

Nothing prevents you from doing that. You can implement your own iterator to behave as you describe. Just don't break iterator semantics for everyone else because it's convenient for one use case out of many. Any C++ developer, understanding common iterator semantics, can implement your behavior on top of the current implementation without special knowledge about the implementation. Your "fix" changes semantics in a way where that'd no longer be true, as the past-the-end iterator no longer points past the end, but at the end (which is impossible in case of an empty range, creating another problem).

In summary: C++ developers understand iterators, they expect them to follow established rules. The concept of the past-the-end iterator is fundamental, don't mess with that. What you want can be done without modifying the library code. At most, we should improve documentation around the subject.

I rest my case.

Jun 21 '22 06:06 falbrechtskirchinger

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, (...)

You can avoid that with strict = false. See https://godbolt.org/z/Ee1Pr3Gse.

No, this don't fix the problem, it is not about strict, it is about the parsing before the strict check

About breaking the API I don't think it is true, all ctest pass correctly and final behaviour is the same for all common function

It's a breaking change and I've demonstrated why. Passing the unit tests does not invalidate that assertion.

My idea is it that it is better to have the input_adapter pointing to the last consumed iterator/value so leaving the door open to any stream implementation (and in your proxy_adapter simply do a ++ to get from where start parsing again or whatever) rather than closing it away, at least like this you have both option at your disposal

You're breaking expected iterator semantics. The past-the-end iterator points past the end because otherwise, we'd have no way to represent an empty range. You're suggestion to "simply do a ++" would not be sound because of this.

Never said that i want to break the past-the-end iterator, simply that value pointed by iterator when finished parsing should be last parsed char and not the one after it

Giving some context: personally I started using nlohmann::json for the very powerfull combo of the sax_parser with a stream as input, giving me the ability to parse json and msgpack that are bigger than the available ram with a custom sax_parser keeping only the data which I'm actually interested (and also without any heap allocations, mostly on MCUs) This is way i keep pushing on it, it is a real (and mostly hidden) killer feature of this library

Nothing prevents you from doing that. You can implement your own iterator to behave as you describe. Just don't break iterator semantics for everyone else because it's convenient for one use case out of many. Any C++ developer, understanding common iterator semantics, can implement your behavior on top of the current implementation without special knowledge about the implementation. Your "fix" changes semantics in a way where that'd no longer be true, as the past-the-end iterator no longer points past the end, but at the end (which is impossible in case of an empty range, creating another problem).

Yes, the fact that the upper layer is advancing the iterator one position ahead of the one actually used

In summary: C++ developers understand iterators, they expect them to follow established rules. The concept of the past-the-end iterator is fundamental, don't mess with that. What you want can be done without modifying the library code. At most, we should improve documentation around the subject.

I rest my case.

I reiterate, it is not about breaking iterator, like this library itself say I want to stick to this specification: https://en.cppreference.com/w/cpp/named_req/InputIterator

Explaining as a simple loop where bool parse(char) return false when the last } in json is encountered:

Current implementation is:

auto it = data.begin();
while( it != data.end() ){
    auto value = *it;
    it++;
    if( ! parse(value) ){
        break;
    }
}
//Now iterator point to one item after the last actually used

Proposed implementation is:

auto it = data.begin();
while( it != data.end() ){
    auto value = *it;
    if( ! parse(value) ){
        break;
    }
    it++;
}
//Now iterator point to the last item that was actually used

Hope that this is clear of where the problem lays

Jun 21 '22 08:06 imwhocodes

I've looked at the discussion, and thought a lot about this, and done some researching, and I'm still not sure if it's possible to "fix" this use case at this level. I looked at https://en.cppreference.com/w/cpp/iterator/istream_iterator and it seems like it should be possible for the stream iterator itself, rather than the json adapter, to handle this. However, the EmulateStream in the example seems to do exactly what I expect the stream iterator does.

I don't know how to reconcile the "leaving the iterator pointing at the last thing read" with "the iterator needs to be pointing to the next thing to read in order to start parsing".

On the other hand, there is no visible change as far as the caller of iterator_input_adapter::get_character() is concerned, and that's the only API that class has.

~There are two callers of that function. One has unget() support, and the other doesn't. Could the real problem be the lack of unget() support in binary_reader.hpp?~

That's JSON vs CBOR, not regular parse vs sax parse, got the two confused, never mind.

Jun 23 '22 17:06 gregmarr

@falbrechtskirchinger

Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

That's what std::istream_iterator does too.

https://en.cppreference.com/w/cpp/iterator/istream_iterator

std::istream_iterator is a single-pass input iterator that reads successive objects of type T from the std::basic_istream object for which it was constructed, by calling the appropriate operator>>. The actual read operation is performed when the iterator is incremented, not when it is dereferenced. The first object is read when the iterator is constructed. Dereferencing only returns a copy of the most recently read object.

Jun 23 '22 17:06 gregmarr

The real scenario problem it is not about knowing if the stream is closed, but it is that on the current input_adapter implementation one char is always read ahead, this mean that if you receive one json on a socket you need the first char of the second to finish parsing the first, and the second one will fail missing it's own first char

The unget() support I mention would help with the second one will fail missing it's own first char but not you need the first char of the second to finish parsing the first if the read blocks waiting for the next char.

Having a newline or other whitespace after the last char of the first would actually fix both issues, but that doesn't help if you can't change the source.

Jun 23 '22 17:06 gregmarr

@falbrechtskirchinger

Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

That's what std::istream_iterator does too.

The difference being that std::istream_iterator is designed to be an adapter between streams and iterators. iterator_input_adapter is an adapter between the parser interface (i.e., get_character()) and iterators (i.e., NOT streams).

If you want a stream_like_input_adapter, build one, and let's add that. I've no objections to a solution that doesn't break iterator_input_adapter's current behavior.

Jun 23 '22 18:06 falbrechtskirchinger

@falbrechtskirchinger I just went looking through the discussions in this PR for it and couldn't find it. Can you remind me again which case is broken with this change?

Jun 23 '22 19:06 gregmarr

@falbrechtskirchinger I just went looking through the discussions in this PR for it and couldn't find it. Can you remind me again which case is broken with this change?

See #3548.

Jun 23 '22 19:06 falbrechtskirchinger

Thanks, that makes sense. I hadn't been considering the underlying iterator surviving beyond the adapter.

Jun 23 '22 19:06 gregmarr

@falbrechtskirchinger

Because your implementation reads on the advance, you'll inevitably run into problems. You're violating the expected semantics and complain that the result isn't what you want.

That's what std::istream_iterator does too.

Another thought: Combining an EmulateStream that reads on advance with the proxy_iterator from #3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.

Once again, there are several (external) solutions to this problem. Ideally, we should work towards making input_adapter part of the public API and allow users to pass an input_adapter to parse() directly.

Jun 23 '22 19:06 falbrechtskirchinger

Is some way I cannot understand why we should preserve an internal behaviour that is not defined by the API, on the other side is possible that people is relaying on that hidden/implicit behaviour

What I'm proposing now is to split input_adapter like this, with the use of some SFINAE and/or std::iterator_traits:

Keep current implementation of input_adapter but mark it as deprecated
If user_iterator is derived from std::forward_iterator use current implementation
If user_iterator is derived from std::istream_iterator use proposed implementation

What do you think about it? @gregmarr @falbrechtskirchinger

Jun 23 '22 21:06 imwhocodes

Combining an EmulateStream that reads on advance with the proxy_iterator from https://github.com/nlohmann/json/pull/3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.

I took a stringstream, created an istream_iterator on it, wrapped it in minor modification to the proxy_iterator that takes the iterator type instead of the container type, and that works. If sax_parse returned the iterator at the end, that would eliminate the need for the proxy_iterator. Since it doesn't, you need the disposable wrapper so you can keep the istream_iterator that potentially contains the next character around to use it for the next sax_parse.

https://godbolt.org/z/M5cYfbcbW

Jun 23 '22 22:06 gregmarr

Combining an EmulateStream that reads on advance with the proxy_iterator from #3548 should also work. The problem is that since normally you lose access to the iterator, the character that has already been read from the underlying stream is lost.

I took a stringstream, created an istream_iterator on it, wrapped it in minor modification to the proxy_iterator that takes the iterator type instead of the container type, and that works. If sax_parse returned the iterator at the end, that would eliminate the need for the proxy_iterator. Since it doesn't, you need the disposable wrapper so you can keep the istream_iterator that potentially contains the next character around to use it for the next sax_parse.

https://godbolt.org/z/M5cYfbcbW

Got it, but now you need to always keep the istream_iterator around in scope, otherwise when it get destructed you lost a char

See https://godbolt.org/z/1f9sGW6cE (btw funny that losing one char make the json still valid but with a different meaning)

Still more convinced that the previously proposed could be a good solution managing all scenario:

Is some way I cannot understand why we should preserve an internal behaviour that is not defined by the API, on the other side is possible that people is relaying on that hidden/implicit behaviour

What I'm proposing now is to split input_adapter like this, with the use of some SFINAE and/or std::iterator_traits:

Keep current implementation of input_adapter but mark it as deprecated

If user_iterator is derived from std::forward_iterator use current implementation

If user_iterator is derived from std::istream_iterator use proposed implementation

What do you think about it? @gregmarr @falbrechtskirchinger

Jun 25 '22 10:06 imwhocodes

How should we proceed here?

Aug 12 '22 17:08 nlohmann

As is, this PR would conflict with #3548. Unless @imwhocodes wants to try another solution, this can be closed.

Aug 12 '22 17:08 falbrechtskirchinger

json json copied to clipboard

fix iterator_input_adapter consuming input before it should

Pull request checklist

json
json copied to clipboard