html5ever
html5ever copied to clipboard
Does it supply line&column numbers for the parsed tokens?
I searched the sources, and found _line_numer
in a few places, but overall, I had the impression that this info is not available to client code. Am I wrong?
Changes in line numbers are available to client code in the tree builder (https://github.com/servo/html5ever/blob/98d3c0cd01471af997cd60849a38da45a9414dfd/markup5ever/interface/tree_builder.rs#L237-L238). We didn't have a reason to expose column number data in Servo so far, so we didn't both looking into it.
Simiarly, the tokenizer receives a line number with each token: https://github.com/servo/html5ever/blob/98d3c0cd01471af997cd60849a38da45a9414dfd/html5ever/src/tokenizer/interface.rs#L97-L98
thank you @jdm ! :-) I am working on some code that checks links in documents, and tells the user which ones are valid and which not (anymore). For this, I have to be able to tell the user where exactly these links are in the document, so they can fix them. I am currently using some very shady, ueber-simple, self-made HTML parser, because none of the libraries for HTML parsing seem to supply line&column info. I understand, it makes no sense to track these for each little detail, in 99% of use-cases for these libraries, so I am not suggesting to add this. Would be glad for some hints about how to go about this. Will I need to maintain a fork of one of these libraries (eg. html5ever)?
Yeah the line number on its own is kind of useless for certain applications. For my own project I'm having to resort to https://github.com/y21 just to get the exact byte positions of each DOM node.
Positions for DOM nodes were also recently added to JSoup and also seems available in HTML parsers in other major languages, so I think it would make sense if we could figure out a way for html5ever to provide the same. Also there's been several issues over the years asking for similar features.
One thing that I was trying to make work but couldn't quite yet is to provide a byte stream that I can read the offset from as tokens are emitted from html5ever, however since tokens are actually consumed ahead of time it doesn't quite give the right positions. This could maybe be fixed by providing something that's Peekable, but tbh. I didn't really like the direction anyways.
Are there any better ideas of how this could potentially be added in such a way that it's an opt-in performance penalty?
hey @RXminuS :-) ... you resorted to https://github.com/y21/tl? why is it not optimal?
It's not actively maintained and you need to do some hacky things such as replacing script/style/no script content otherwise the ranges will be off since it still matches on those tokens inside (e.g. no state switching)
For anyone else running into this problem, in https://github.com/whatwg/html-build/pull/291 I'm creating a RcDomWithLineNumbers
which overrides the two methods necessary to at least track line numbers in the errors recorded. I'm very much a Rust beginner so it's just kind of been a process of flailing around until I got something working, and the fact that Rust makes you delegate all methods of TreeSink
just to override set_current_line
(to record the current line) and parse_error
(to augment the recorded error with the current line) seems bonkers. But it seems to work so far.
Column numbers, of course, are not so easy.