quick-xml
quick-xml copied to clipboard
A way to get the row/column of a Reader
xml-rs provides TextPosition
for this, while quick-xml only has Reader::buffer_position()
to get the byte position. It would be useful for error reporting if there was a way to compute the row and column from this (preferably without overhead if unused).
This could be implemented somewhere along these lines:
impl<B: Seek + BufRead> Reader<B> {
fn row_col(&self) -> (usize, usize) {
// seek to 0, count lines until `buffer_position` is reached
}
}
(add a proper return type and a better name to taste)
Of course, this is pretty slow as it re-reads the entire content.
This looks like a good solution. With probably a function renaming to show that this is very expensive.
Out of curiousity, @jonas-schievink do you plan on doing a PR?
Not right now, no. If anyone wants to implement this, feel free!
I implemented a PositionRead
wrapper type that builds a table for mapping byte offsets to line:column positions transparently, then added some extra methods to Reader
for doing the lookup. That way, you don't pay anything if you don't use it.
That works fine; the problem is that buffer_position
(which I was using to do the lookup) is rubbish for this. Depending on the exact state of the parser, it can end up pointing to before the event, at the start of the event, or even inside of it. That's to say nothing of errors where you just don't have a byte offset to do lookup on.
As such, I've had to abandon quick-xml
for my current project. I need to be able to report back to users on where problems are, and I don't have time to do open-heart surgery on quick-xml
to improve offset tracking.
I've uploaded my work-in-progress if anyone has a use for it.
Thanks for the report.
The buffer_position
may indeed need some rework. Do you have some particular tests that are failing at hand?
test_multiline_pos
in tests/unit_tests.rs
was where I more or less threw up my hands. Before the second event, the Reader
reports a buffer position of 10
, which puts it after the opening <
of the <?xml ..
.
The test_start_end_comment_pos
test also has a fudge in it, where the position of the third event should be 1:27, but gets reported as 1:26 because the parser is still at the start of the whitespace before the <a />
element.
At a bare minimum, I think Reader
would have to start tracking "last event position" separately from buffer (parsing) position. Maybe also use it for the position of errors. To do it properly, it would also have to start attaching correct positions to things that don't get distinct events (i.e. if there's an error in an attribute value, where is it?), but at that point, we're talking possibly significant API changes.
Great. Thank you
@DanielKeep I have created #123 which fixes an offset by one for buffer_position
. On your branch, the 2 basic pos tests are passing but not the multiline_pos. I have tried to fix it but I think the errors come either from your test or from your changes.
I understand that you may not have time to spend on it but in case, I'll let it open for the moment and merge it later.
Thanks again for your feedback!
EDIT: All your comments about a proper way to manage buffer position are still valid. This is just a fix for the current situation. I'll try to find a way to explicitly manage positions (probably before managing row/columns).
Any progress on this?
I've begun work on something similar, which can possibly be used as a building block for this - adding Span
objects to the different quick_xml::event::Bytes*
structs to get where they are located in the underlying data source, if applicable. This might be usable as a base to compute the line and column of that event from.
That sounds like a decent improvement. Separately on the question of calculating a line + column, it may not be terribly expensive to do SIMD-accelerated passes over the buffer continuously to keep track of how many lines have been seen previously. That would enable calculating the line + column without needing to "Seek" or have the entire document buffered, as you could start counting from the beginning of the current buffer and add it to what has been seen previously.
One issue to look out for is encodings. Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,
Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing Span
s in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.
Byte positions may not perfectly match up depending on how they are calculated in relation to decoding of the contents,
I do not think that this will be a problem because although we work with bytes, we ensure that our boundaries are correct (=not within a character). I think, that users of spans in any case will expect that this numbers are indexes in the original byte stream.
Actually, I have a prototype implementation for several mouths in my private branch which I've pushed right now. I've used approach with a feature that enables storing
Span
s in events. I would like to benchmark it to see if feature flag is needed at all. Also more tests needed and implementation for attributes.
To clarify, does this mean that my work in #552 is unnecessary? Or would you like me to continue with it?
Yes, as I can see, the parts that you've done in #552 I already implemented and I have feelings that they are worked right (but I didn't create tests for that yet, that's the one reason why I didn't make a PR).
I think, it would be better to use my approach, namely that parts:
-
span
is used inPartialEq
implementation but ignored in tests where it is not important -
Option
not needed, span0..0
is used instead (this should decrease size ofspan
field, and because we can expect that there would be many spans it is better to use as little space as we can)
You are free to finish my work if you like (you can rebase your branch over mine and update your PR if you wish). I'm not ready to finish it myself for now.
I do not insist on using feature-flag, I introduced it because spans can occupy some space and if you does not use them it is just waste of memory, but maybe the savings are not so great.
So the things that's should be implemented:
- spans in attributes
- spans in errors
- tests
- benchmark it to see if we need to hide it under feature-flag
(you can rebase your branch over mine and update your PR if you wish).
How would I go about rebasing a branch on my fork over a branch on a different fork? Sorry, I'm not entirely well-versed in git usage; I've sort of been learning it as I go along.
span
is used inPartialEq
implementation but ignored in tests where it is not important
The primary reason for ignoring span
in PartialEq
was so that the addition of spans to the crate wouldn't break previously-working comparisons between Bytes*
instances that came from different locations, such as comparing one from a reader with one constructed via new()
, in transitive dependencies (if used in the dependent crate / other transitive dependencies) or even the dependent crate (if used in transitive dependencies). That it also made the tests work was a happy side-effect, but not the main intention.
If you on Windows, then you're probably using TortoiseGit, then just use RMB-click on my commit in commit log and select Rebase "parser_position_tracking" onto this...(G). The other GUI clients should have the similar ways of doing that. If your prefer command-line, then I think you already known how to find required information.
wouldn't break previously-working comparisons between
Bytes*
instances that came from different locations,
I cannot imagine a reason why someone would be need to compare two events to compare their content or at all. I think we should not worry about that. PartialEq
implemented mostly for our own purposes for tests. Of course, if you known the reason...