sphinx-lint Check for missing underscores in references

trafficstars

Here are the two reST traps that I find most nasty:

`Missing underscore after hyperlink <https://example.com>`

.. missing-underscore-before-target-name:

Section
=======

Relatedly, there could be a check for the equally nasty `Missing space before angle bracket<https://example.com>`_.

Feb 03 '22 03:02 jeanas

The one I find even more commonly on the PEPs, that is not caught by any tooling and often made even by experienced Sphinx users, is missing the underscore after the target name for indirect targets, i.e.

See the `homepage of the example.com website <example>`__ for more information.

.. example: https://www.example.com

This is very difficult to spot both in the source and in the output without tooling, to the point where I'm going to add another pygrep check for it, assuming it doesn't get added in sphinx-lint first.

Apr 26 '22 04:04 CAM-Gerlach

OMG! I thought I knew reST syntax, and I was totally unaware of that one. Grepping in my projects, I just saw a few broken links ... Thank you CAM!

(I used to love reST, but this comforts me in my belief that it is too quirky.)

Apr 26 '22 05:04 jeanas

Often made even by experienced Sphinx users

Thanks for helping further illustrate my point :laughing:

I had at least 2-3 other highly experienced PEP editors/Python docs team members make comments on my recent PRs asking if the _ was required or if it was a mistake, heh, and I had to explain it to them each time. Lo and behold, though, despite having previously grepped for it and been very aware of looking for it, when rechecking my own PEP 639 that I'd recently rewritten and was the subject of most of those reviews, my further grepping I still discovered and fixed multiple instances where I'd forgotten it. and there were a number of other examples on the PEPs repo.

Apr 26 '22 06:04 CAM-Gerlach

It would be possible to support indirect links like this in reStructuredText natively, I believe. For the time being adding a linting check would be useful (I am one of the aforementioned PEP editors who was caught out by the syntax...)

A

Apr 26 '22 22:04 AA-Turner

OMG! I thought I knew reST syntax, and I was totally unaware of that one. Grepping in my projects, I just saw a few broken links ... Thank you CAM!

(I used to love reST, but this comforts me in my belief that it is too quirky.)

May 17 '22 13:05 jeanas

(Sorry, this wasn't intended.)

May 17 '22 13:05 jeanas

What we have:

https://github.com/sphinx-contrib/sphinx-lint/blob/main/tests/fixtures/xfail/hyperlink-missing-underscore.rst
https://github.com/sphinx-contrib/sphinx-lint/blob/main/tests/fixtures/xfail/hyperlink-missing-space.rst

So we still miss the missing underscore before target name checker.

Oct 19 '22 06:10 JulienPalard

I'm having hard time figuring how to detect missing underscore before target name.

It's hard because missing the underscore make it a valid comment, like:

.. TODO: add a section with background and terminology?
.. ???: Maybe codetags inside packages (__init__.py files) could have
...

To avoid comments-induced false-positives it would be nice to first search for hyperlink references, BUT as references could cross document it would need to first parse the whole project, which I don't like (feels out-of-scope for sphinx-lint).

Also I bet Sphinx do whines if an hyperlink reference points to no hyperlink target. I also bet that if the target is missing the leading _ it's treated as a comment by Sphinx. So Sphinx already spot this issue, (but don't report it in the right place, it'll whine about the refernece not having a target, not about the target being malformed).

Just in case someone want to play with it, I doodled around with the following code:

diff --git a/sphinxlint.py b/sphinxlint.py
index b4843e6..ea171bc 100755
--- a/sphinxlint.py
+++ b/sphinxlint.py
@@ -248,18 +248,30 @@ def _clean_heuristic(paragraph, regex):
         paragraph = paragraph[: candidate.start()] + paragraph[candidate.end() :]
 
 
-def clean_paragraph(paragraph):
+def clean_paragraph(
+    paragraph,
+    keep_inline_literals=False,
+    keep_inline_internal_targets=False,
+    keep_hyperlink_references=False,
+    keep_anonymous_hyperlink_references=False,
+    keep_roles=False,
+):
     """Removes all good constructs, so detectors can focus on bad ones.
 
     It removes all well formed inline literals, inline internal
     targets, and roles.
     """
     paragraph = escape2null(paragraph)
-    paragraph = _clean_heuristic(paragraph, inline_literal_re)
-    paragraph = _clean_heuristic(paragraph, inline_internal_target_re)
-    paragraph = _clean_heuristic(paragraph, hyperlink_references_re)
-    paragraph = _clean_heuristic(paragraph, anonymous_hyperlink_references_re)
-    paragraph = normal_role_re.sub("", paragraph)
+    if not keep_inline_literals:
+        paragraph = _clean_heuristic(paragraph, inline_literal_re)
+    if not keep_inline_internal_targets:
+        paragraph = _clean_heuristic(paragraph, inline_internal_target_re)
+    if not keep_hyperlink_references:
+        paragraph = _clean_heuristic(paragraph, hyperlink_references_re)
+    if not keep_anonymous_hyperlink_references:
+        paragraph = _clean_heuristic(paragraph, anonymous_hyperlink_references_re)
+    if not keep_roles:
+        paragraph = normal_role_re.sub("", paragraph)
     return paragraph.replace("\x00", "\\")
 
 
@@ -391,7 +403,9 @@ ascii_allowed_after_inline_markup = r"""[-.,:;!?/'")\]}>]"""
 unicode_allowed_after_inline_markup = r"[\p{Pe}\p{Pi}\p{Pf}\p{Pd}\p{Po}]"
 
 
-def inline_markup_gen(start_string, end_string, extra_allowed_before=""):
+def inline_markup_gen(
+    start_string, end_string, extra_allowed_before="", contents=".*?"
+):
     """Generate a regex matching an inline markup.
 
     inline_markup_gen('**', '**') geneates a regex matching strong
@@ -419,7 +433,7 @@ def inline_markup_gen(start_string, end_string, extra_allowed_before=""):
                        # The inline markup end-string must be separated by at least one
                        # character from the start-string.
         {QUOTE_PAIRS_NEGATIVE_LOOKBEHIND}
-        .*?
+        {contents}
         (?<=\S)       # Inline markup end-strings must be immediately preceded by non-whitespace.
         {end_string}  # Inline markup end
     )
@@ -909,6 +923,51 @@ def check_bad_dedent(file, lines, options=None):
     yield from errors
 
 
+def find_all_named_hyperlink_references(document):
+    """Finds all named hyperlink references in the given document.
+
+    There's two syntax for named hyperlink references:
+
+    - No start-string, end-string = "_".
+    - Start-string = "`", end-string = "`_". (Phrase references.)
+    """
+    document = clean_paragraph(document, keep_hyperlink_references=True)
+    phrase_reference_re = inline_markup_gen("`", "`_")
+    named_hyperlinks_found_in_phrase_ref = set()
+    while True:
+        phrase_reference = phrase_reference_re.search(document)
+        if not phrase_reference:
+            break
+        phrase_reference_str = phrase_reference[0].replace("\n", " ")
+        with_angle_brackets = re.match("`.*<(.*)>`_", phrase_reference_str)
+        if with_angle_brackets:
+            within_angle_brackets = with_angle_brackets[1]
+            if within_angle_brackets.endswith("_"):
+                named_hyperlinks_found_in_phrase_ref.add(within_angle_brackets[:-1])
+        else:
+            named_hyperlinks_found_in_phrase_ref.add(phrase_reference_str[1:-2])
+        document = (
+            document[: phrase_reference.start()] + document[phrase_reference.end() :]
+        )
+    reference_re = inline_markup_gen("(?!_)", "(?<!_)_", contents=r"\S*")
+    references = {
+        ref[:-1]
+        for ref in reference_re.findall(document)
+        if not ref.startswith("[#")  # We don't want footnotes here.
+    }
+    return references | named_hyperlinks_found_in_phrase_ref
+
+
+def find_all_hyperlink_targets(document):
+    """Finds all hyperlink targets in the given document.
+    It should look like:
+
+    .. _hyperlink-name: link-block
+    """
+    target_re = re.compile(r"^\s*\.\. _([^:]+): ", flags=re.M)
+    return set(target_re.findall(document))
+
+
 def parse_args(argv=None):
     """Parse command line argument."""
     if argv is None:

I played with a test:

from pathlib import Path

import pytest
from sphinxlint import (
    find_all_named_hyperlink_references,
    find_all_hyperlink_targets,
    hide_non_rst_blocks,
)

SMALL_DOC = """
Clicking on this internal hyperlink will take us to the target_
below.

.. _target: https://mdk.fr
"""

FIXTURE_DIR = Path(__file__).resolve().parent / "fixtures"


def test_find_all_named_hyperlink_references():
    found = find_all_named_hyperlink_references(SMALL_DOC)
    assert found == {"target"}


def test_find_all_hyperlink_targets():
    found = find_all_hyperlink_targets(SMALL_DOC)
    assert found == {"target"}


@pytest.mark.parametrize(
    "rst_file", [str(file) for file in FIXTURE_DIR.glob("**/*.rst")]
)
def test_hyperlinks(rst_file):
    rst_file = Path(rst_file)
    rst = rst_file.read_text(encoding="UTF-8")
    rst = "".join(hide_non_rst_blocks(rst.splitlines(keepends=True)))
    assert find_all_named_hyperlink_references(rst) == find_all_hyperlink_targets(rst)

It is particularily slow, it passes many files (don't forget to run sh download-more-tests.sh before pytest), but easily fails as soon as it finds an implicit hyperlink target, such as a title.

Oct 29 '22 09:10 JulienPalard

Maybe there's something I'm missing, but at least to do what this issue says on the tin—to find URI targets and ref target labels missing leading underscores—wouldn't something like the following regex do the trick?

^[ \t]*\.\. ([^_:\[ ][^:]*: (https?://|(ftp|mailto):)|[A-Za-z]([A-Za-z\-_.:+]*[A-Za-z])?:$)

It handles both types of targets, though you'll probably want to break up into two separate checks, one for hyperlink targets and one for ref target labels. Its able to tell intended URI targets nearly unambiguously from comments, as it checks to see if they actually contain the beginning of a URI with a valid protocol (i.e. meaningfully supported by common browsers for external user-facing links); to further reduce any change of false positives, we could add the full URL-matching regex and require the line to terminate after it, Otherwise, it checks if its a ref target label, which can only use simple reference names as defined in the reST spec and must end the line after the trailing :, and also doesn't match footnotes and citations, which must start with [ (and no _).

It works with the m flag, processing one line at a time, which should give a lot better performance than trying to match on the whole file.

I ran this against a basic unit test suite with RegExr with both positive and negative test cases from this thread, and everything passed. I then ran it against the CPython docs, the CPython NEWS, the PEPs repo, the devguide and the Spyder-IDE docs. There were no hits in anything but the CPython docs, as they all use -W and don't generally use ref target labels much (yet) that aren't used internally, which would generate a warning. In the CPython repo, there were two true positives (both ref target labels that were missing an underscore), and one false positive (a TODO comment that matched the same):

CPython grep output

$ git grep -Pn -B 1 -A 3 '^[ \t]*\.\. ([^_:\[ ][^:]*: (https?://|(ftp|mailto):)|[A-Za-z]([A-Za-z\-_.:+]*[A-Za-z])?:$)' Doc/**/*.rst
Doc/howto/logging-cookbook.rst-3779-
Doc/howto/logging-cookbook.rst:3780:.. patterns-to-avoid:
Doc/howto/logging-cookbook.rst-3781-
Doc/howto/logging-cookbook.rst-3782-Patterns to avoid
Doc/howto/logging-cookbook.rst-3783------------------
--
Doc/howto/regex.rst-8-
Doc/howto/regex.rst:9:.. TODO:
Doc/howto/regex.rst-10-   Document lookbehind assertions
Doc/howto/regex.rst-11-   Better way of displaying a RE, a string, and what it matches
Doc/howto/regex.rst-12-   Mention optional argument to match.groups()
--
Doc/installing/index.rst-138-
Doc/installing/index.rst:139:.. installing-per-user-installation:
Doc/installing/index.rst-140-
Doc/installing/index.rst-141-... install packages just for the current user?
Doc/installing/index.rst-142------------------------------------------------

However, that one last remaining class of false positives—comments whose first lines are entirely valid simple reference names (i.e. a single word) and end with a colon—should be able to be essentially entirely eliminated inside of Sphinx-Lint via a Python check (or via a negative lookahead in multiline regex, but that would probably be more complex and expensive) that the first non-whitespace line that follows any matches has the same indent level as the matching line, which a comment would not (unless it is a single line comment, which semantically would be very unlikely to be a single word ending with a colon with no followup text it is referring to).

To avoid comments-induced false-positives it would be nice to first search for hyperlink references, BUT as references could cross document it would need to first parse the whole project, which I don't like (feels out-of-scope for sphinx-lint).

AFAIK, plain reST hyperlink references are only per-file. It is only Sphinx cross-reference target labels (used by the :ref: role) that are cross file. However, I don't think there's any need for this, since in practice it seems a single regex with a bit of heuristics can distinguish them with very high specificity.

Furthermore, the real value in this check over Sphinx (since it indirectly warns for it otherwise) is for ref target labels you're not using right now, e.g. for sections, but create as a matter of course both for use in the future, or use by external Intersphinx users, as well as more robust external linking.

Also I bet Sphinx do whines if an hyperlink reference points to no hyperlink target.

Assuming I'm inferring what you mean to say here correctly, yes, this produces a warning.

I also bet that if the target is missing the leading _ it's treated as a comment by Sphinx.

Yes.

Oct 30 '22 18:10 CAM-Gerlach

sphinx-lint sphinx-lint copied to clipboard

Check for missing underscores in references

sphinx-lint
sphinx-lint copied to clipboard