quick-xml A way to get the row/column of a Reader

A way to get the row/column of a Reader

Open jonas-schievink opened this issue 6 years ago • 17 comments

xml-rs provides TextPosition for this, while quick-xml only has Reader::buffer_position() to get the byte position. It would be useful for error reporting if there was a way to compute the row and column from this (preferably without overhead if unused).

Jan 23 '18 23:01 jonas-schievink

This could be implemented somewhere along these lines:

impl<B: Seek + BufRead> Reader<B> {
    fn row_col(&self) -> (usize, usize) {
        // seek to 0, count lines until `buffer_position` is reached
    }
}

(add a proper return type and a better name to taste)

Of course, this is pretty slow as it re-reads the entire content.

Jan 24 '18 21:01 jonas-schievink

This looks like a good solution. With probably a function renaming to show that this is very expensive.

Jan 30 '18 05:01 tafia

Out of curiousity, @jonas-schievink do you plan on doing a PR?

Mar 01 '18 08:03 tafia

Not right now, no. If anyone wants to implement this, feel free!

Mar 01 '18 15:03 jonas-schievink

I implemented a PositionRead wrapper type that builds a table for mapping byte offsets to line:column positions transparently, then added some extra methods to Reader for doing the lookup. That way, you don't pay anything if you don't use it.

That works fine; the problem is that buffer_position (which I was using to do the lookup) is rubbish for this. Depending on the exact state of the parser, it can end up pointing to before the event, at the start of the event, or even inside of it. That's to say nothing of errors where you just don't have a byte offset to do lookup on.

As such, I've had to abandon quick-xml for my current project. I need to be able to report back to users on where problems are, and I don't have time to do open-heart surgery on quick-xml to improve offset tracking.

I've uploaded my work-in-progress if anyone has a use for it.

Mar 16 '18 05:03 DanielKeep

Thanks for the report. The buffer_position may indeed need some rework. Do you have some particular tests that are failing at hand?

Mar 16 '18 06:03 tafia

test_multiline_pos in tests/unit_tests.rs was where I more or less threw up my hands. Before the second event, the Reader reports a buffer position of 10, which puts it after the opening < of the <?xml ...

The test_start_end_comment_pos test also has a fudge in it, where the position of the third event should be 1:27, but gets reported as 1:26 because the parser is still at the start of the whitespace before the <a /> element.

At a bare minimum, I think Reader would have to start tracking "last event position" separately from buffer (parsing) position. Maybe also use it for the position of errors. To do it properly, it would also have to start attaching correct positions to things that don't get distinct events (i.e. if there's an error in an attribute value, where is it?), but at that point, we're talking possibly significant API changes.

Mar 16 '18 06:03 DanielKeep

Great. Thank you

Mar 16 '18 06:03 tafia

@DanielKeep I have created #123 which fixes an offset by one for buffer_position. On your branch, the 2 basic pos tests are passing but not the multiline_pos. I have tried to fix it but I think the errors come either from your test or from your changes.

I understand that you may not have time to spend on it but in case, I'll let it open for the moment and merge it later.

Thanks again for your feedback!

EDIT: All your comments about a proper way to manage buffer position are still valid. This is just a fix for the current situation. I'll try to find a way to explicitly manage positions (probably before managing row/columns).

Mar 22 '18 05:03 tafia

Any progress on this?

Oct 04 '18 21:10 bbigras

I've begun work on something similar, which can possibly be used as a building block for this - adding Span objects to the different quick_xml::event::Bytes* structs to get where they are located in the underlying data source, if applicable. This might be usable as a base to compute the line and column of that event from.

Feb 11 '23 00:02 Themayu

That sounds like a decent improvement. Separately on the question of calculating a line + column, it may not be terribly expensive to do SIMD-accelerated passes over the buffer continuously to keep track of how many lines have been seen previously. That would enable calculating the line + column without needing to "Seek" or have the entire document buffered, as you could start counting from the beginning of the current buffer and add it to what has been seen previously.

One issue to look out for is encodings. Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,

Feb 11 '23 05:02 dralley

Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing Spans in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.

Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,

I do not think that this will be a problem because although we work with bytes, we ensure that our boundaries are correct (=not within a character). I think, that users of spans in any case will expect that this numbers are indexes in the original byte stream.

Feb 11 '23 15:02 Mingun

Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing Spans in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.

To clarify, does this mean that my work in #552 is unnecessary? Or would you like me to continue with it?

Feb 11 '23 16:02 Themayu

Yes, as I can see, the parts that you've done in #552 I already implemented and I have feelings that they are worked right (but I didn't create tests for that yet, that's the one reason why I didn't make a PR).

I think, it would be better to use my approach, namely that parts:

span is used in PartialEq implementation but ignored in tests where it is not important
Option not needed, span 0..0 is used instead (this should decrease size of span field, and because we can expect that there would be many spans it is better to use as little space as we can)

You are free to finish my work if you like (you can rebase your branch over mine and update your PR if you wish). I'm not ready to finish it myself for now.

I do not insist on using feature-flag, I introduced it because spans can occupy some space and if you does not use them it is just waste of memory, but maybe the savings are not so great.

So the things that's should be implemented:

spans in attributes
spans in errors
tests
benchmark it to see if we need to hide it under feature-flag

Feb 11 '23 17:02 Mingun

(you can rebase your branch over mine and update your PR if you wish).

How would I go about rebasing a branch on my fork over a branch on a different fork? Sorry, I'm not entirely well-versed in git usage; I've sort of been learning it as I go along.

span is used in PartialEq implementation but ignored in tests where it is not important

The primary reason for ignoring span in PartialEq was so that the addition of spans to the crate wouldn't break previously-working comparisons between Bytes* instances that came from different locations, such as comparing one from a reader with one constructed via new(), in transitive dependencies (if used in the dependent crate / other transitive dependencies) or even the dependent crate (if used in transitive dependencies). That it also made the tests work was a happy side-effect, but not the main intention.

Feb 11 '23 19:02 Themayu

If you on Windows, then you're probably using TortoiseGit, then just use RMB-click on my commit in commit log and select Rebase "parser_position_tracking" onto this...(G). The other GUI clients should have the similar ways of doing that. If your prefer command-line, then I think you already known how to find required information.

wouldn't break previously-working comparisons between Bytes* instances that came from different locations,

I cannot imagine a reason why someone would be need to compare two events to compare their content or at all. I think we should not worry about that. PartialEq implemented mostly for our own purposes for tests. Of course, if you known the reason...

Feb 11 '23 19:02 Mingun

quick-xml quick-xml copied to clipboard

A way to get the row/column of a Reader

quick-xml
quick-xml copied to clipboard