sphinx icon indicating copy to clipboard operation
sphinx copied to clipboard

HTML5 permalinks are not permanent if section header starts with number

Open abitrolly opened this issue 4 years ago • 41 comments

In pip Changelog slugs in html anchors are not permanently pointed to corresponding version. Instead, they are incremental position numbers, which start with #id1, so when new version of pip is released all anchors shift and start to point to a different version.

To Reproduce

#!/bin/bash

DOCDIR=testnumslug

rm -rf $DOCDIR
mkdir $DOCDIR

cat <<EOF > $DOCDIR/index.rst
Hi
==

1.2.0
-----

1.1.0
-----

1.0.0
-----
EOF

# Application error:
# config directory doesn't contain a conf.py file (testnumslug)
touch $DOCDIR/conf.py

sphinx-build $DOCDIR $DOCDIR/_html

echo -e "\n-----\n"

grep -R 'Permalink' $DOCDIR/_html/index.html

This gives the output.

<h1>Hi<a class="headerlink" href="#hi" title="Permalink to this headline">¶</a></h1>
<h2>1.2.0<a class="headerlink" href="#id1" title="Permalink to this headline">¶</a></h2>
<h2>1.1.0<a class="headerlink" href="#id2" title="Permalink to this headline">¶</a></h2>
<h2>1.0.0<a class="headerlink" href="#id3" title="Permalink to this headline">¶</a></h2>

Expected behavior

<h1>Hi<a class="headerlink" href="#hi" title="Permalink to this headline">¶</a></h1>
<h2>1.2.0<a class="headerlink" href="#1.2.0" title="Permalink to this headline">¶</a></h2>
<h2>1.1.0<a class="headerlink" href="#1.1.0" title="Permalink to this headline">¶</a></h2>
<h2>1.0.0<a class="headerlink" href="#1.0.0" title="Permalink to this headline">¶</a></h2>

Environment info

  • Python version: 3.9.1
  • Sphinx version: 3.2.1

Additional context

  • https://github.com/pypa/pip/issues/8152

abitrolly avatar Jan 19 '21 21:01 abitrolly

The reason to the current behaviour is likely due to the HTML4 spec:

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

So id="1.2.0" is technically invalid (although I suspect many browsers would handle it fine, since HTML5 loosened the restriction).

The behaviour still feels quite unintuitive to me, however. I would expect Sphinx to generate something more stable, such as id="id-1_2_0" instead.

With that said, you can always specify an explicit reference yourself:

Hi
==

.. _v1_2_0:

1.2.0
-----

.. _v1_1_0:

1.1.0
-----

.. _v1_0_0:

1.0.0
-----

This would always work regardless of the section title.

uranusjr avatar Jan 20 '21 08:01 uranusjr

Sphinx generates HTML5 by default since 2.0 https://github.com/sphinx-doc/sphinx/issues/4587

abitrolly avatar Jan 20 '21 09:01 abitrolly

It comes from the node ID generation rule of docutils; the core library of Sphinx. It was defined to support many kinds of formats. https://repo.or.cz/docutils.git/blob/HEAD:/docutils/docutils/nodes.py#l2220

tk0miya avatar Jan 20 '21 13:01 tk0miya

What is the role of this function then?

https://github.com/sphinx-doc/sphinx/blob/3ed7590ed411bd93b26098faab4f23619cdb2267/sphinx/util/nodes.py#L435-L439

abitrolly avatar Jan 20 '21 13:01 abitrolly

It's a local ID generator for Sphinx domains. It does not relate to the section IDs.

tk0miya avatar Jan 20 '21 13:01 tk0miya

It looks like an override by import path. Although it doesn't make it any better.

In [2]: from docutils.nodes import make_id                                                                                                                                
In [3]: make_id('1.2.0')                                                                                                                                                  
Out[3]: ''
In [7]: from sphinx.util.nodes import _make_id                                                                                                                            
In [8]: _make_id('1.2.0')                                                                                                                                                 
Out[8]: ''

Is it possible to define a function html5_id(string: str) and delegate HTML id generation to it?

abitrolly avatar Jan 20 '21 16:01 abitrolly

The same behavior also goes with non-ASCII headers, producing idX. If a header consists of both ASCII and non-ASCII characters, all non-ASCII parts will be removed.

madjxatw avatar Jan 21 '21 06:01 madjxatw

@madjxatw understood. HTML5 removes all restrictions from IDs, which makes even these valid.

<p id="#">Foo.
<p id="##">Bar.
<p id="♥">Baz.
<p id="©">Inga.
<p id="{}">Lorem.
<p id="“‘’”">Ipsum.
<p id="⌘⌥">Dolor.
<p id="{}">Sit.
<p id="[attr=value]">Amet.
<p id="++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.">Hello world!

https://mathiasbynens.be/notes/html5-id-class

abitrolly avatar Jan 21 '21 09:01 abitrolly

@abitrolly, exactly, so Sphinx needs to implement a HTML5 version of make_id() to keep consistency with its default HTML5 output.

madjxatw avatar Jan 21 '21 15:01 madjxatw

@madjxatw not Sphinx, some human needs to sit down and write the code. While the code seems trivial, right now it is unclear where to place the code.

abitrolly avatar Jan 21 '21 16:01 abitrolly

@abitrolly, hopefully some unicode slugifier (e.g. https://github.com/mozilla/unicode-slugify) could be used as an extension or be integrated somehow into Sphinx.

madjxatw avatar Jan 21 '21 16:01 madjxatw

Sphinx has still supported HTML4 output. The HTML Help builder also depends on HTML4. In addition to this, I can't say the change does not affect other builders. Sphinx is not only for building HTML5.

tk0miya avatar Jan 21 '21 16:01 tk0miya

@tk0miya, is it possible to have an option that lets users decide whether to enable unicode permalink?

madjxatw avatar Jan 21 '21 16:01 madjxatw

I can't promise the option works fine for "all" builders. If I added it to Sphinx, I'll describe it as "it might work. But not promised. Please don't report us even if something broken" :-p I think I can't provide such an option from the official. Please hack your own risk.

tk0miya avatar Jan 21 '21 16:01 tk0miya

That's all right, it wouldn't be a big problem to hack it by ourselves , however it still sounds a bit sorry that unicode IDs is not officially supported, especially for those non-English (e.g. East Asian) writers who really need IDs with their own language characters. :-(

madjxatw avatar Jan 21 '21 16:01 madjxatw

I don't know about the details about how Sphinx works internally, but couldn't a custom unicode ID maker be invoked only when the HTML5 builder is in use?

idnsunset avatar Jan 21 '21 17:01 idnsunset

I don't know about the details about how Sphinx works internally, but couldn't a custom unicode ID maker be invoked only when the HTML5 builder is in use?

It's diffucult. The node IDs are generated in the reading phase. The result of the phase is cached and used for incremental builds. It means introducing the new ID breaks the incremental build feature.

tk0miya avatar Jan 21 '21 17:01 tk0miya

It's diffucult. The node IDs are generated in the reading phase. The result of the phase is cached and used for incremental builds. It means introducing the new ID breaks the incremental build feature.

But HTML IDs should not be node IDs. Is it possible during initial read to generate IDs in a structure that allows to output a proper HTML5 slug on writing? Like if ID is autogenerated from the title, store the title.

abitrolly avatar Jan 21 '21 19:01 abitrolly

But HTML IDs should not be node IDs. Is it possible during initial read to generate IDs in a structure that allows to output a proper HTML5 slug on writing? Like if ID is autogenerated from the title, store the title.

Of course, it's possible if you give a wonderful patch! (IMO, it's impossible to me as I commented "it's difficult" above).

tk0miya avatar Jan 23 '21 04:01 tk0miya

@tk0miya what do you mean by "if's difficult"? It could help if you can point to locations where Sphinx reads and caches node ID, and where to insert write_html5_id` call.

abitrolly avatar Jan 23 '21 04:01 abitrolly

The cross-referencing system of Sphinx has been based on the node IDs. So I can't imagine how we replace it by unicode IDs. I guess we need to rewrite whole of docutils and Sphinx. So I can't tell you where to do that.

tk0miya avatar Jan 23 '21 05:01 tk0miya

@tk0miya the idea is not about replacing internal node IDs. It is about writing IDs in HTML5 format on output to HTML5. All IDs written this way will be consistent.

abitrolly avatar Jan 23 '21 09:01 abitrolly

I don't know how to do that. But all contributions are welcome!

tk0miya avatar Jan 23 '21 09:01 tk0miya

I am afraid I can go on only with funded contributions. Learning a codebase like this in my free time is not sustainable. A pity that this seemingly simple generator turned out to be that complex on the inside.

abitrolly avatar Jan 23 '21 09:01 abitrolly

It comes from the node ID generation rule of docutils; the core library of Sphinx. It was defined to support many kinds of formats.

The rationale and details of this design decision are explained in https://docutils.sourceforge.io/docs/ref/rst/directives.html#identifier-normalization There is an open feature request for less restrictive IDs For the original question: setting an id-prefix will keep permanent IDs on section headings starting with a number since Docutils 0.18, so this can provide a workaround in future Sphinx versions.

gmilde avatar Nov 09 '21 22:11 gmilde

Still stumbling over this in 2025. Especially bothersome with changelogs.

Just as a demonstration: the latest release of black will always be

https://black.readthedocs.io/en/stable/change_log.html#id1

there is no way to hard link to older versions.

This should not be an issue anymore. Especially since it is valid HTML5.

There should at least be a way to overwrite this behavior.

Kamik423 avatar Jul 29 '25 08:07 Kamik423

The workaround with an id-prefix can be used with current Docutils and Sphinx:

Example

In docutils.conf, set:

[parsers]
id_prefix: black:

Then, the "self-link" for the heading 25.1.0 becomes <a class="headerlink" href="#black:25-1-0" title="Link to this heading">¶</a>.

Side-effect: all ids in the project are prefixed with black: (reStructuredText "reference names" are not affected).

A simpler workaround would be to use headings starting with an ASCII letter (release 25.1.0 or similar) or to provide explicit targets like

.. _release 25.1.0:

25.1.0
=====

A configurable ID generation is still on the TODO list. Doing this right will break a lot of existing behaviour, so this needs to be done very carefully.

gmilde avatar Jul 30 '25 20:07 gmilde

A configurable ID generation is still on the TODO list. Doing this right will break a lot of existing behaviour, so this need to be done very carefully.

Fixing permalinks for numbered headers will be sufficient to close this issue.

abitrolly avatar Jul 31 '25 22:07 abitrolly

I've looked into this, and this may be a bug in both docutils and sphinx. I'm not entirely sure if sphinx.util.nodes._make_id overrides every instance of the fragment/anchor definition, but I can see this behaviour is present in docutils as well as sphinx.

https://github.com/sphinx-doc/sphinx/blob/9eb3d7940fa587f5706268f19a1d6a977ff24fad/sphinx/util/nodes.py#L538-L562

https://github.com/docutils/docutils/blob/a5b983b73263445510d032845a70082eec7e2ca9/docutils/docutils/nodes.py#L2942-L2987 (not sure why this isn't embedding)

Changing the end of this definition in docutils to use the following seems to work

    # shrink runs of whitespace and replace by hyphen
    id = _non_id_chars.sub('-', ' '.join(id.split()))
    id = id.rstrip('-+')
    if not (clean_id := id.lstrip('0123456789-')):
        clean_id = "id-" + id
    return str(clean_id)

This could probably be optimized further, but it converts _non_id_at_ends: re.Pattern[str] = re.compile('^[-0-9]+|-+$') to be applied without regex, and continues to trim the end of the ID. If the ID is zeroed from the regex, then we take the original ID and apply id- to the front of it.

This may break existing fragments on existing projects, so proper thought and consideration is necessary for this bugfix.

Alternative implementation

If the id starts with a disallowed character, prepend id- to it, after using lstrip('-')

This would keep the entire string, where existing names such as 3 ways to contribute now becomes id-3-ways-to-contribute rather than ways-to-contribute (which might be a downside in this particular example.

    # shrink runs of whitespace and replace by hyphen
    id = _non_id_chars.sub('-', ' '.join(id.split()))
    id = id.lstrip('-').rstrip('-+')
    if _non_id_at_ends.match(id):
        id = "id-" + id.lstrip()
    return str(id)


_non_id_chars: re.Pattern[str] = re.compile('[^a-z0-9]+')
_non_id_at_ends: re.Pattern[str] = re.compile('^[0-9]')

I'm hoping to fix this as I was interesting in fixing this on the likes of black, pytest-cov, and more, but would love to fix it upstream so the entire ecosystem can benefit from this.

onerandomusername avatar Nov 05 '25 21:11 onerandomusername

I've done some checking of the ecosystem to see what's affected on other projects, as this would impact a swath of notable projects.

Known affected projects (incomplete)

requests: https://requests.readthedocs.io/en/latest/community/updates/#id2 urllib3: https://urllib3.readthedocs.io/en/stable/changelog.html#id1 pypa: https://www.pypa.io/en/latest/history/#id1 flake8: https://flake8.pycqa.org/en/latest/release-notes/7.3.0.html#id1 PyNaCl: https://pynacl.readthedocs.io/en/latest/changelog/#id1 django: https://docs.djangoproject.com/en/5.2/releases/#id1 black: https://black.readthedocs.io/en/stable/change_log.html#id1 cython: https://cython.readthedocs.io/en/latest/src/changes.html#id1

Existing projects with workarounds

sqlalchemy: https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-2.0.42 alembic https://alembic.sqlalchemy.org/en/latest/changelog.html#change-1.17.2 cryptography*: https://cryptography.io/en/latest/changelog/#v46-0-3 (seems to have a workaround but would otherwise be impacted) furo*: https://pradyunsg.me/furo/changelog/#energetic-eminence (less of an impact because each version has a code name) pip*: https://pip.pypa.io/en/stable/news/#v25-3 (patched with a local plugin)

Potential negative impacts

https://bugzilla.readthedocs.io/en/latest/using/creating-an-account.html#creating-an-account (at least one example but more exist)

Source of these projects: https://www.sphinx-doc.org/en/master/examples.html, combined with some moderately random selection, and manual checking of changelog pages.

onerandomusername avatar Nov 05 '25 22:11 onerandomusername