html5ever icon indicating copy to clipboard operation
html5ever copied to clipboard

Does it supply line&column numbers for the parsed tokens?

Open hoijui opened this issue 2 years ago • 7 comments

I searched the sources, and found _line_numer in a few places, but overall, I had the impression that this info is not available to client code. Am I wrong?

hoijui avatar Jan 17 '23 17:01 hoijui

Changes in line numbers are available to client code in the tree builder (https://github.com/servo/html5ever/blob/98d3c0cd01471af997cd60849a38da45a9414dfd/markup5ever/interface/tree_builder.rs#L237-L238). We didn't have a reason to expose column number data in Servo so far, so we didn't both looking into it.

jdm avatar Jan 18 '23 05:01 jdm

Simiarly, the tokenizer receives a line number with each token: https://github.com/servo/html5ever/blob/98d3c0cd01471af997cd60849a38da45a9414dfd/html5ever/src/tokenizer/interface.rs#L97-L98

jdm avatar Jan 18 '23 05:01 jdm

thank you @jdm ! :-) I am working on some code that checks links in documents, and tells the user which ones are valid and which not (anymore). For this, I have to be able to tell the user where exactly these links are in the document, so they can fix them. I am currently using some very shady, ueber-simple, self-made HTML parser, because none of the libraries for HTML parsing seem to supply line&column info. I understand, it makes no sense to track these for each little detail, in 99% of use-cases for these libraries, so I am not suggesting to add this. Would be glad for some hints about how to go about this. Will I need to maintain a fork of one of these libraries (eg. html5ever)?

hoijui avatar Jan 18 '23 09:01 hoijui

Yeah the line number on its own is kind of useless for certain applications. For my own project I'm having to resort to https://github.com/y21 just to get the exact byte positions of each DOM node.

Positions for DOM nodes were also recently added to JSoup and also seems available in HTML parsers in other major languages, so I think it would make sense if we could figure out a way for html5ever to provide the same. Also there's been several issues over the years asking for similar features.

One thing that I was trying to make work but couldn't quite yet is to provide a byte stream that I can read the offset from as tokens are emitted from html5ever, however since tokens are actually consumed ahead of time it doesn't quite give the right positions. This could maybe be fixed by providing something that's Peekable, but tbh. I didn't really like the direction anyways.

Are there any better ideas of how this could potentially be added in such a way that it's an opt-in performance penalty?

RXminuS avatar Jul 28 '23 11:07 RXminuS

hey @RXminuS :-) ... you resorted to https://github.com/y21/tl? why is it not optimal?

hoijui avatar Jul 28 '23 18:07 hoijui

It's not actively maintained and you need to do some hacky things such as replacing script/style/no script content otherwise the ranges will be off since it still matches on those tokens inside (e.g. no state switching)

RXminuS avatar Jul 28 '23 19:07 RXminuS

For anyone else running into this problem, in https://github.com/whatwg/html-build/pull/291 I'm creating a RcDomWithLineNumbers which overrides the two methods necessary to at least track line numbers in the errors recorded. I'm very much a Rust beginner so it's just kind of been a process of flailing around until I got something working, and the fact that Rust makes you delegate all methods of TreeSink just to override set_current_line (to record the current line) and parse_error (to augment the recorded error with the current line) seems bonkers. But it seems to work so far.

Column numbers, of course, are not so easy.

domenic avatar Feb 15 '24 06:02 domenic