tools icon indicating copy to clipboard operation
tools copied to clipboard

wip: lint: Add line numbers to error messages

Open neilpa opened this issue 6 months ago • 6 comments

See #837

This is very much a work-in-progress, but I wanted to share so you could see how it's shaping up. I've structured it so that line numbers can be added incrementally to the different lint checks. And so it's easy to grep for what is/isn't updated by grepping for LintMessage and checking for presence/absence of LintSubmessage.

It works with both the xpath and regex approaches for detecting errors. For the latter, I've added a new XhtmlSourceFile class that manages the line splitting and regex searching per-line. I had also intended to expose the xpath helpers on that as well, but I'm less sure and may generalize for the CSS source files.

I've only updated the tabular output for now, I'm leaving the plain output alone for now. I want the unit tests to continue working as-is while I work through these changes.

Here's an example of what it looks like for now. There's probably a nicer way to format it (and sort by line number)

Screenshot 2025-06-17 at 12 34 46 PM

Remaining errors with sub-messages that likely need line numbers

CSS

  • [ ] c-001
  • [ ] c-002
  • [ ] c-005
  • [ ] c-006
  • [ ] c-008
  • [ ] c-009

METADATA

  • [ ] m-010
  • [ ] m-012
  • [ ] m-013
  • [ ] m-014
  • [ ] m-018
  • [ ] m-020
  • [ ] m-024
  • [ ] m-044
  • [ ] m-048
  • [ ] m-050
  • [x] ~~m-051~~ Missing metadata elements don't have a line number to reference
  • [ ] m-052
  • [ ] m-058
  • [ ] m-070

SEMANTICS & CONTENT

  • [x] s-001
  • [x] s-004
  • [ ] s-006
  • [x] s-029
  • [x] s-031
  • [ ] s-032
  • [ ] s-034
  • [ ] s-053
  • [x] s-085
  • [ ] s-087
  • [x] s-103

TYPOGRAPHY

  • [ ] t-003
  • [x] t-004
  • [x] t-005
  • [ ] t-008
  • [ ] t-010
  • [x] t-017
  • [x] t-019
  • [ ] t-021
  • [x] t-025
  • [x] t-026
  • [x] t-028
  • [ ] t-029
  • [ ] t-041
  • [ ] t-047
  • [ ] t-049
  • [ ] t-058
  • [ ] t-063
  • [ ] t-064
  • [ ] t-076
  • [ ] t-077

XHTML

  • [ ] x-003
  • [ ] x-013
  • [ ] x-014
  • [ ] x-017
  • [ ] x-018
  • [ ] x-019

TYPOS

  • [ ] y-001
  • [ ] y-006
  • [x] y-007
  • [ ] y-015
  • [ ] y-019
  • [ ] y-030

Errors that don't use sub-messages currently but should now that line numbers can reference specific errors

CSS

  • [ ] c-004 - border colors
  • [ ] c-007 - hyphens
  • [ ] c-016 - text align left => initial
  • [ ] c-022 - rem => em
  • [ ] c-023 - illegal font-size unit
  • [ ] c-024 - illegal line-height unit
  • [ ] c-025 - illegal percantage
  • [ ] c-027 - font size < 1

METADATA

  • [ ] m-008
  • [ ] m-009
  • [ ] m-011
  • [ ] m-015 - make the LXML exception the submessage
  • [ ] m-016
  • [ ] m-017
  • [ ] m-023
  • [ ] s-036 - link to the body tag
  • [ ] s-037 - link to the body tag
  • [ ] m-038
  • [ ] m-039
  • [ ] m-046 - link to the ignore tag
  • [ ] m-047 - link to the path=* elements
  • [ ] m-053 - link to the first
  • [ ] m-055
  • [ ] m-056
  • [ ] m-057
  • [ ] m-064
  • [ ] m-065
  • [ ] m-066
  • [ ] m-067
  • [ ] m-069

SEMANTICS & CONTENT

  • [ ] s-005
  • [ ] s-012
  • [ ] s-013
  • [ ] s-021 - find and link the existing title
  • [ ] s-023
  • [ ] s-025
  • [ ] s-030
  • [ ] s-042 - link to offending tables
  • [ ] s-055
  • [ ] s-069
  • [ ] s-075
  • [ ] s-076
  • [ ] s-096

TYPOGRAPHY

  • [x] t-002
  • [ ] t-046
  • [ ] t-055
  • [ ] t-072
  • [ ] t-073

XHTML

  • [ ] x-001
  • [ ] x-004
  • [ ] x-005
  • [ ] x-006
  • [x] x-008

neilpa avatar Jun 17 '25 19:06 neilpa

I'm making good progress getting line numbers attached to lint errors. I've converted 197 of the 259 errors that use sub-messages to reference the specific problem in the source text. This covers most of what I'd classify as the "easy" fixes that fell into a few different find/replace patterns. The remaining 62 errors (checklist in first message) will need varying degrees of additional changes to make work. The trickiest will be the errors that mutate the raw text before running regex checks.

The other ~110 errors aren't using sub-messages. Some could be improved to point to the specific instances, like x-008. Others really are global and don't need additional line number context, like the filesystem f-### errors. My preference would be to leave these errors as is and make any improvements in a follow-up PR (none are in the checklist).

neilpa avatar Jun 18 '25 22:06 neilpa

Great work!

A few preliminary notes:

  • Do we need the XhtmlSourceFile class? It looks like we only instantiate it once, and use one of its functions, search, only twice.
  • In XhtmlSourceFile.search, is performance affected by running a regex once per line line, instead of once per file?
  • Please go ahead and make direct improvements like your suggestion for x-008.
  • We should try to make this a complete PR and not put off other work for later.
  • For c-001 it might be worth checking if lxml now supports those selectors, as much of this code is pretty old. Or maybe we can use cssselect in se build? If there's potential then this would be a separate issue/PR, but then we could remove c-001.
  • I think we should rename the LintError class to LintSubmessage because that's what it is/what it is replacing.
  • We can hyperlink URLs in the console output using the Rich library, but can we hyperlink URLs to open a specific line number? Is there a URL scheme for that, maybe something like file:///path/to/file.xhtml:11-14?

Very good progress, I'm excited to see this become a reality.

acabal avatar Jun 19 '25 16:06 acabal

  • Do we need the XhtmlSourceFile class? It looks like we only instantiate it once, and use one of its functions, search, only twice.

Yes. I have more local changes where it's used much more extensively. I'm going to push it up the stack so it's instantiated once per file and replace the trio of (filename, dom, file_contents) arguments passed to the _lint_* functions. I also plan to encapsulate the helpers necessary to mutate text before running checks while maintaining line numbers.

  • In XhtmlSourceFile.search, is performance affected by running a regex once per line line, instead of once per file?

I'm not sure yet, I plan to profile these changes once I'm closer to done. I haven't noticed any differences running the lint tests, but those aren't representative as most of time is spinning up new processes and test overhead.

It's possible searching a pre-compiled regex many times over small input is about the same as searching once over much larger input. The performance will also be sensitive to the specific regex and input strings. If it's a problem, the alternative is create an index of byte offsets to line numbers and use the regex match offsets to lookup line numbers.

  • Please go ahead and make direct improvements like your suggestion for x-008.
  • We should try to make this a complete PR and not put off other work for later.

Okay. I wasn't sure if you wanted to do a few incremental PRs or one big one. I'll scan through the remaining errors and add the relevant ones to the checklist above.

  • For c-001 it might be worth checking if lxml now supports those selectors, as much of this code is pretty old. Or maybe we can use cssselect in se build? If there's potential then this would be a separate issue/PR, but then we could remove c-001.

Okay, I haven't looked closely at any of the CSS rules yet.

  • I think we should rename the LintError class to LintSubmessage because that's what it is/what it is replacing.

Done.

  • We can hyperlink URLs in the console output using the Rich library, but can we hyperlink URLs to open a specific line number? Is there a URL scheme for that, maybe something like file:///path/to/file.xhtml:11-14?

I'm not sure, my cursory searching suggests the format varies across text editors. Also, at least for VSCode, the existing hyperlinks don't work well with the integrated terminal. Instead of opening inside the current window and editor context it launches a new project window with the file.

neilpa avatar Jun 19 '25 21:06 neilpa

I'm not sure, my cursory searching suggests the format varies across text editors. Also, at least for VSCode, the existing hyperlinks don't work well with the integrated terminal. Instead of opening inside the current window and editor context it launches a new project window with the file.

Looks like there's an open VSCode issue about line numbers in URIs: https://github.com/microsoft/vscode/issues/586 Might be worth checking other popular editors like Sublime, etc.

acabal avatar Jun 19 '25 21:06 acabal

I'll scan through the remaining errors and add the relevant ones to the checklist above.

I've gone through these, about half of the errors would benefit from adding sub-messages and line numbers. I've added these to the checklist and moved it into the main PR message so it's easier to track what's left. It's definitely going to take some time to chip away at what's left.

neilpa avatar Jun 20 '25 17:06 neilpa

Also note that we'll have to update the tests too.

acabal avatar Jun 20 '25 18:06 acabal

I just pushed a commit that replaces almost all of the remaining regex-based tests with xpath tests, so that should make it easier to include line numbers. There are a few regex-based tests remaining (searching for unfilled SE variables in boilerplate files) for which it seems impractical to do an xpath because a basic regex will be much faster.

acabal avatar Jul 02 '25 14:07 acabal

I've been on vacation so haven't made progress in a while, but should have some time to revisit later this week.

I'd actually figured out a way to handle most of the regex cases, including those that further mutate the source text, but hadn't pushed those changes before I left. It reuses the same trick I implemented for tracking line numbers across comment removal.

The xpath queries that target raw text nodes, rather than actual DOM nodes, are the remaining hard cases. The former simply return the raw strings, not a node object that includes sourceline. It should be possible to update those queries to be in terms of the parent node instead. However, some of the queries are quite complex and this is my first meaningful use of xpath.

neilpa avatar Jul 06 '25 13:07 neilpa

The xpath queries that target raw text nodes, rather than actual DOM nodes, are the remaining hard cases. The former simply return the raw strings, not a node object that includes sourceline. It should be possible to update those queries to be in terms of the parent node instead. However, some of the queries are quite complex and this is my first meaningful use of xpath.

Yes it should be possible to convert them to element matches and not text matches. If you give me some examples of ones that you're looking at, I can take a look. Much of lint is covered by tests so you can experiment with changing xpaths and as long as the tests continue to pass, you should be OK.

acabal avatar Jul 06 '25 17:07 acabal

This is one of the more complex ones I recall seeing

https://github.com/standardebooks/tools/blob/7c3c202dcb6b1029474d5cfafd0cb0fba010a34b/se/se_epub_lint.py#L1454-L1457

I think that would need to query for the p directly with something like .../p/[contains(text(), '\n') and normalize-space(text())] but I'm not sure how the second half of the query would need to be updated

neilpa avatar Jul 06 '25 19:07 neilpa

If there are two bracket expressions in a row, I think they can just be replaced with and.

. is a shorthand for text() so you wouldn't have to change that. Otherwise I think you're on the right track.

acabal avatar Jul 06 '25 20:07 acabal

How is this coming along?

acabal avatar Jul 17 '25 21:07 acabal

I’m still chipping away at it as time allows, just don’t have a lot of that currently. If you want to take what’s here and finish it, I won’t be offended. Otherwise, it’ll likely be a few more weeks to work through what’s left.

neilpa avatar Jul 18 '25 03:07 neilpa

I'll leave it to you, just wanted to make sure there is still progress happening for this valuable feature. Lint is updated often, so the longer this takes, the more annoying merge conflicts there will be.

acabal avatar Jul 18 '25 13:07 acabal

If you can get the core functionality completely finished, then we could merge that in and ask for people to help to apply it to the individual messages.

acabal avatar Jul 18 '25 13:07 acabal

I finally had some time to revisit this branch. I rebased and fixed up all the merge conflicts, which was easier than I anticipated

I would say this is ready for review. It has all the core functionality and has line numbers for the majority of errors. It includes the improved regex helpers for tracking line numbers across mutations that I hadn't previously pushed. I haven't tried to cleanup the commit history in the branch since it looks like you normally squash merge.

I still need to update the "what's left" list of errors that could still use line numbers. Figured I'd open an issue with the list that also reflects the recent shuffling of some of the error codes.

neilpa avatar Aug 21 '25 22:08 neilpa

Excellent work, thanks! I think this all looks good and I'll squash merge it in.

Please go ahead and update the list of remaining lint errors. Note that the master list of lint errors has probably changed since you started this project.

I'll use this locally for the next few weeks to shake it out, and if everything looks good we can cut a release sometime around mid-Sept.

acabal avatar Aug 22 '25 18:08 acabal

I've completed a minor refactor, moving the SourceFile class into the se_epub_lint.py file since that's the only place that it's used, and moved some of the helper functions into private class methods.

I've also added line numbers for:

  • m-020
  • t-076
  • t-077
  • y-006

acabal avatar Aug 22 '25 19:08 acabal

Filed #858. I cross-checked my list from above against the code after your updates so it should be up to date.

neilpa avatar Aug 22 '25 21:08 neilpa

Great, thanks!

acabal avatar Aug 22 '25 21:08 acabal