Tests: change-in-output with docutils r10151 (0.22-rc)
Describe the bug
Revision r10151 of docutils (aka commit b25db649a5d073d76424e6434e12cdba53ed57e8 in the project's git repository) introduces a change that alters the label identifiers produced in the output of some Sphinx builds.
My understanding is that this should not negatively impact users -- the output documentation is still valid, I believe -- but it does cause a failure in one of our LaTeX test cases, test_latex_labels.
In particular, the expected label identifiers for a figure element and a table element differ; previously they were id1 and id2 respectively - after the docutils revision r10151, they become id2 and id3.
How to Reproduce
$ git clone https://github.com/sphinx-doc/sphinx.git && cd sphinx
$ git checkout 546170754f3f2f96c0d12176b2d2fb5688ca75ab
$ python3 -m venv .venv && source .venv/bin/activate
$ pip install '.[test]'
$ pip install git+https://repo.or.cz/docutils.git@b25db649a5d073d76424e6434e12cdba53ed57e8#subdirectory=docutils
$ pytest tests/test_builders/test_build_latex.py -k test_latex_labels
Environment Information
Platform: linux; (Linux-6.12.27-rt-amd64-x86_64-with-glibc2.41)
Python version: 3.13.3 (main, Apr 10 2025, 21:38:51) [GCC 14.2.0])
Python implementation: CPython
Sphinx version: 8.3.0+/546170754
Docutils version: 0.22rc3.dev
Jinja2 version: 3.1.6
Pygments version: 2.19.1
Sphinx extensions
N/A
Additional context
- Previously/recently discussed in #13606 (an thread addressing a separate, current unit test failure).
An additional, related question: the id1 hyperlink target previously emitted for the figure is now placed into a LaTeX \phantomsection; is that something we should be concerned about?
(both @jfbu and I had noticed that in the #13606 discussion thread, but I neglected to document it during the initial writeup of this issue)
I've merged #13610 to fix master, but will keep this open until we're sure everything is OK.
Thanks @AA-Turner - agreed, makes sense; I'm not completely confident about the approach taken in #13610.
An additional, related question: the
id1hyperlink target previously emitted for thefigureis now placed into a LaTeX\phantomsection; is that something we should be concerned about?
I think we should a bit, as the HTML output has somehow gotten rid of this extra target. Admittedly \phantomsection\label{\detokenize{index:id1}} following a (cross-) reference implemented by \hyperref mark-up does not cause changes in the output PDF layout. But if there are hundreds of them it will make the PDF bigger as well has auxiliary files and thus cause some slight slow-down in the build times. Perhaps the HTML builder has a transform getting rid of creating such targets if not needed?
Sorry now for utterly naive question but what would be the reST/Sphinx mark-up to actually create a reference (not artificially, i.e. not knowing in advance which of index:id1, index:id2, index.id3 is going to be used for a source file named index.rst) to these targets so that I could test (either in HTML or LaTeX) that they actually are correctly referred-to. As for now, the used reST source in this test-latex-labels does not generate usage of hyperlinks targeting the id1 associated with the "embedded hyperlink reference" (this was edited to not mention here id2 and id3 associated in this example with some figure and table directives). It is time to admit I have never been able to understand Docutils to start with in these matters, finding the vocabulary already confusing. For example this is the reason some improvements I would like to achieve with footnotes is indefinitely delayed, because I don't understand the Docutils code to start with (the Sphinx LaTeX footnote transform is nicely documented but to make a variation achieving other goals, it is needed to actually understand the Docutils layer, which I have tried from time to time and never could achieve).
An additional, related question: the
id1hyperlink target previously emitted for thefigureis now placed into a LaTeX\phantomsection; is that something we should be concerned about?
This is probably due to another recent commit:
r10152 changes the LaTeX writer to no longer ignore "ids" in inline elements. Inline elements with non-empty "ids" attribute will now generate a label and (unless the parent is a title or caption) a \phantomsection to allow cross references to this point.
Where exactly is the id1 placed? On the <figure>, <image>, or <caption> Doctree element or some other element inside these?
Off-topic: basic.css makes it so that ¶ headerlinks are made visible on hover for table captions, but not so for figure labels.
Where exactly is the
id1placed? On the<figure>,<image>, or<caption>Doctree element or some other element inside these?
Both HTML and LaTeX have similar mark-up such as
<table class="docutils align-default" id="id2">
<caption><span class="caption-text">table caption</span><a class="headerlink" href="#id2" title="Link to this table">¶</a></caption>
But now regarding the source
Embedded standalone hyperlink reference: `subsection <section1_>`_.
it renders in LaTeX as
Embedded standalone hyperlink reference: {\hyperref[\detokenize{index:section1}]{\sphinxcrossref{subsection}}}\phantomsection\label{\detokenize{index:id1}}.
(and other idN's are shifted) whereas in HTML it still renders as
<p>Embedded standalone hyperlink reference: <a class="reference internal" href="#section1">subsection</a>.</p>
and the other idN's are not modified.
This is probably due to another recent commit: r10152 c
We saw the modification in LaTeX output already at revision 10151.
I think we should a bit, as the HTML output has somehow gotten rid of this extra target. Admittedly
\phantomsection\label{\detokenize{index:id1}}following a (cross-) reference implemented by\hyperrefmark-up does not cause changes in the output PDF layout. But if there are hundreds of them it will make the PDF bigger as well has auxiliary files and thus cause some slight slow-down in the build times. Perhaps the HTML builder has a transform getting rid of creating such targets if not needed?
In HTML, all elements can act as target, if they have an ID. In LaTeX the cost for potential targets is higher: e.g. <p id="id1">Paragraph 1.</p> corresponds to
\phantomsection\label{id1}
Paragraph 1.
Has HTML actually dropped the identifier id1?
This is probably due to another recent commit: r10152 changes the LaTeX writer to no longer ignore "ids" in inline elements. Inline elements with non-empty "ids" attribute will now generate a label and (unless the parent is a title or caption) a
\phantomsectionto allow cross references to this point.
Are you referring to the Docutils LaTeX writer? Sphinx has its own so the \phantomsection is generated there due to the doctree having some label.
Has HTML actually dropped the identifier
id1?
Yes it has.
But now regarding the source ...
This makes the issue a lot clearer: Let's assume there is a "section1" somewhere, then
`subsection <section1_>`_.
is translated to the Doctree (in "pseudoxml" representation):
<reference name="subsection" refid="section1">
subsection
<target ids="subsection" names="subsection" refid="section1">
The target is a an indirect target, it helps resolving references by mapping refname="subsection" to refid="section1"
(see docutils.transforms.references.InternalTargets).
The Docutils writers skip indirect targets (i.e. <target> elements with:
def visit_target(self, node) -> None:
# Skip indirect targets:
if ('refuri' in node # external hyperlink
or 'refid' in node # resolved internal link
or 'refname' in node): # unresolved internal link
return
...
It seems the Sphinx LaTeX writer has a different test in its
visit_target() method that fails to correctly filter indirect targets
and the addition of an ids attribute triggered the additional
{\sphinxcrossref{subsection}}}\phantomsection\label{\detokenize{index:id1}}.
It is, however, unclear why it uses an auto-id ("id1") instead of the existing ID ("subsection"). The label is not required (as links to "subsection" are resolved by a transfrom) and with the auto-id it is useless and confusing.
Aside:
There is a misnomer in the test sample:
> Embedded standalone hyperlink reference: `subsection <section1_>`_.
A standalone hyperlink is always external and must start with a recognized URI scheme. E.g., https:example.org becomes
<reference refuri="https:example.org">https:example.org</reference>
A hyperlink reference may embed a URI reference (external target) or an alias reference (indirect internal target).
The above hyperlink syntax is actually a "named hyperlink reference with embedded alias reference" equivalent to the named hyperlink reference plus named indirect internal target
subsection_
.. _subsection: section1_
@gmilde We will fix the wording of the test later, thanks for explanations.
As per the visit_target() from Sphinx LaTeX writer it has indeed a more complex logic. Here is how it looked like before merge of #11333. I am omitting this changeset because it makes it look more complicated, and even with it reverted I observe the new id1 decoration when testing with Docutils current HEAD https://github.com/sphinx-doc/sphinx/blob/5a6b2b16b6b588cdd4dab1b43b7873153da10baa/sphinx/writers/latex.py#L1459-L1503
The test failure uncovers a Sphinx LaTeX writer bug and a possible improvement:
Bug:
The Sphinx LaTeX writer writes a label for an indirect hyperlink target.
Indirect targets are handled by the references.IndirectHyperlinks transform and should be ignored by writers.
Using an auto-id instead of the target's ID adds to the confusion.
Enhancement proposal:
Use tagnames instead of "id" as prefix for auto-generated IDs. This results in more meaningfull auto-IDs helps to limit the changes if rST-syntax leading to auto-ID generation is added or removed. See Docutils commits r8403 and r8404 for a similar change in Docutils and its effect on test samples. (cf. also r8771).
It seems the source of our problems is in
if node.get('refid'):
prev_node = get_prev_node(node)
if isinstance(prev_node, nodes.reference) and node['refid'] == prev_node['refid']:
# a target for a hyperlink reference having alias
pass
else:
add_target(node['refid'])
In our case the source `subsection <section1_>`_ will generate as you explained a node <target ids="['id1']" names="['subsection']" refid="section1"/> and prev_node is <reference name="subsection" refid="section1">subsection</reference> so the "pass" branch is executed. On exit we still have
for id in node['ids']:
add_target(id)
so I wonder if the pass branch should have been a return rather, but I have a difficulty understanding "hyperlink reference having alias" terminology.
It seems the source of our problems is in
if node.get('refid'):
Indeed. (Aside, it should be changed to if 'refid' in node:.)
prev_node = get_prev_node(node) if isinstance(prev_node, nodes.reference) and node['refid'] == prev_node['refid']: # a target for a hyperlink reference having alias pass else: add_target(node['refid'])In our case the source `` `subsection <section1_>`_ `` will generate as you explained a node `<target ids="['id1']" names="['subsection']" refid="section1"/>` and `prev_node` is `<reference name="subsection" refid="section1">subsection</reference>` so the "pass" branch is executed.
This is strange code in several ways:
- why test and pass instead of testing for the negation?
- A "refid" is a reference to some other element's ID. Adding a label at the place of the indirect target referencing an element somewhere else seem simply wrong.
On exit we still have
for id in node['ids']: add_target(id)
This explains, why a label is written since Docutils equips a <target> element with non-empty "ids". It does not explain, why the id is replaced by an auto-generated id!
so I wonder if the
passbranch should have been areturnrather,
AFAIK, the correct way would be to ignore all indirect targets, i.e.
if 'refid' in node:
return
However, I may miss something due to Sphinx != Docutils. What would be the test results with this simple change?
For what it is worth if I replace pass by return in the visit_target() of LaTeX writer at the location mentioned in previous comment, all our tests pass [^1] and with Docutils 0.21.2 the exact same sphinx.tex is produced of our own docs. With Docutils at r10153, there is a small diff, showing that the \phantomsection\label{..} occurs only twice:
--- sphinxreturn.tex 2025-06-03 16:53:51.266507000 +0200
+++ sphinxcurrent.tex 2025-06-03 16:52:13.476215793 +0200
@@ -57945,7 +57945,7 @@
\sphinxnolinkurl{https://docutils.sourceforge.io/docs/ref/rst/directives.html\#raw}
%
\end{footnote} directive, or also from a LaTeX environment associated to a
-{\hyperref[\detokenize{latex:latexcontainer}]{\sphinxcrossref{container class}}} and using such
+{\hyperref[\detokenize{latex:latexcontainer}]{\sphinxcrossref{container class}}}\phantomsection\label{\detokenize{latex:container-class}} and using such
\sphinxcode{\sphinxupquote{%
\PYG{k}{\PYGZbs{}sphinxsetup}\PYG{n+nb}{\PYGZob{}}...\PYG{n+nb}{\PYGZcb{}}%
}}.
@@ -59185,7 +59185,7 @@
\sphinxAtStartPar
Sphinx LaTeX support code is split across multiple smaller\sphinxhyphen{}sized files.
Rather than adding code to the preamble via
-{\hyperref[\detokenize{latex:latex-elements-confval}]{\sphinxcrossref{latex\_elements}}}{[}\sphinxcode{\sphinxupquote{\textquotesingle{}preamble\textquotesingle{}}}{]} it is
+{\hyperref[\detokenize{latex:latex-elements-confval}]{\sphinxcrossref{latex\_elements}}}\phantomsection\label{\detokenize{latex:latex-elements}}{[}\sphinxcode{\sphinxupquote{\textquotesingle{}preamble\textquotesingle{}}}{]} it is
also possible to replace entirely one of the component files of Sphinx
LaTeX code with a custom version, simply by including a modified copy in
the project source and adding the filename to the
The first chunk has indeed a latex:container-class label in our current sphinx.tex which is never referred-to elsewhere. The second chunk has similarly latex:latex-elements label after an internal cross reference, and this should not exist and did not exist with using Docutils 0.21.2 because for some reason the node did not have the corrected refidids.
So a quick resolve would be s/pass/return but I would prefer confirmation by some competent maintainer.
[^1]: I am using pytest and for some reason I get 1 FAILED tests/test_extensions/test_ext_autodoc_importer.py::test_import_native_module_stubs - ModuleNotFoundError: No module named 'fish_licence.halibut' I suppose out actual C/I tests mark this as xfail?
but I have a difficulty understanding "hyperlink reference having alias" terminology.
Hyperlinks with embedded URI or alias are the most complex part. If you want to comprehend rST hyperlink syntax, start with "normal" links and targets like
_`The parrot` is a bird with a strong curved beak ...
We like `parrots`_.
.. _parrots: `the parrot`_
Actually, you don't need to bother about rST syntax as the writer processes Doctree elements.¹
In our example:
<document source="/tmp/foo.rst">
<paragraph>
<target ids="the-parrot" names="the\ parrot">
The parrot
is a bird with a strong curved beak …
<paragraph>
We like
<reference name="parrots" refid="the-parrot">
parrots
.
<target ids="parrots" names="parrots" refid="the-parrot">
Clearly, a \label{parrots} for the second (indirect) target would be useless and
\phantomsection\label{the-parrot} at the place of the second target would be wrong.
¹The bad news is, that the section <target> in the Doctree documentation is "to be completed".
@gmilde I posted my comment prior to having seen yours, sorry.
What would be the test results with this simple change?
All our tests pass. With Docutils 0.21.2 a complex file such as our own sphinx.tex (the docs at https://www.sphinx-doc.org/ in latex format) remains identical. With Docutils HEAD, it avoids two non-sensical additions of labels to places where cross references occur to other locations in the docs.
Adding a label at the place of the indirect target referencing an element somewhere else seem simply wrong.
Yes. No wonder now why I have never been able to understand Sphinx handling of Docutils reference nodes as my primary reading is via the latex writer... I never could really understand what refid means.
This explains, why a label is written since Docutils a
element with non-empty "ids". It does not explain, why the id is replaced by an auto-generated id!
I don't know. By the way in testing I have noticed the switch to Docutils HEAD causes a shift in the numbering of footnotes on Sphinx side for the LaTeX output, but this is probably benign (not checked PDF, no time). However I now don't have time to check closer. Sphinx has a rather evolved footnote transform for the latex builder, well-documented, (in particular those arising from our configutation value latex_show_urls = 'footnote').
[...] The second chunk has similarly
latex:latex-elementslabel after an internal cross reference, and this should not exist and did not exist with using Docutils 0.21.2 because for some reason the node did not have therefid.
- `refid`
+ `ids`
??
So a quick resolve would be
s/pass/returnbut I would prefer confirmation by some competent maintainer.
I wrote that prior to having seen @gmilde suggestion to do that :flushed: :sweat:.
So a quick resolve would be
s/pass/returnbut I would prefer confirmation by some competent maintainer.I wrote that prior to having seen @gmilde suggestion to do that 😳 😓.
I actually suggest replacing the complete chunk
if isinstance(prev_node, nodes.reference) and node['refid'] == prev_node['refid']:
# a target for a hyperlink reference having alias
pass
else:
add_target(node['refid'])
with return.
However, I am not a "competent Sphinx maintainer" but a Docutils developer with only rudimentary knowledge of Sphinx.
[...] The second chunk has similarly
latex:latex-elementslabel after an internal cross reference, and this should not exist and did not exist with using Docutils 0.21.2 because for some reason the node did not have therefid.
refid
ids??
yes your are absolutely right sorry about that:
- with 0.21.2:
<target names="['subsection']" refid="section1"/> - with r10151:
<target ids="['id1']" names="['subsection']" refid="section1"/>
AFAIK, the correct way would be to ignore all indirect targets, i.e.
if 'refid' in node: return
@gmilde If I do that, and using 0.21.2 I get 1 LaTeX failed test:
FAILED tests/test_builders/test_build_latex.py::test_duplicated_labels_before_module - AssertionError: duplicated label: 'index:label-1a'
The failure is not due to a duplicated label but to an absent one, but the way the test assert is set-up it produces above output.
The duplicate label issue was handled at #11333. It fixed #11093.
The failure of the test (at tests/roots/test-latex-labels-before-module) is good as it reveals on further examination that in this test no labels whatsoever are produced in situation such as this one involving the Sphinx module directive.
.. _label_1a:
.. module:: module1
text
Refer label_1a_
I added the last line, so that here an attempt is made to refer label_1a but the latter never made it to the output .tex file.
To recap simply doing s/pass/return/ passes all tests, but going more radical and ignoring all nodes having a refid is not a possibility for us.
Ping @picnixz
AFAIK, the correct way would be to ignore all indirect targets [...]
[...] If I do that, and using 0.21.2 I get 1 LaTeX failed test: [...]
AssertionError: duplicated label: 'index:label-1a'[...]. The failure is not due to a duplicated label but to an absent one, but the way the testassertis set-up it produces above output.
To find out if this is a positive change or a regression, we would need to know which label is missing, where it is defined in the source, how it is represented in the Doctree.
The duplicate label issue was handled at #11333. It fixed #11093.
Looking at #11093, I see a Doctree starting with
<document source="/home/picnix/any/sphinx/doc/testfile.rst">
<target refid="somelabel">
<target ids="module-testmodule somelabel" ismod="True" names="somelabel">
<index entries="('pair',\ 'module;\ testmodule',\ 'module-testmodule',\ '',\ None)">
The first <target> is indirect and should be ignored by the writer as it is already "resolved" (i.e. its ids and names attributes were moved to the next node by a transform).
The second <target> has two IDs. These should become "labels" in LaTeX, because there is no refid or refuri attribute. This means references with refid="somelabel" as well as refernces with refid="module-testmodule" should point to the place of this target.
It seems like the attempted fix for the duplicate targets issue got it wrong and kept the label-writing on the first target but removed it from the second...
The failure of the test (at tests/roots/test-latex-labels-before-module) is good as it reveals on further examination that in this test no labels whatsoever are produced in situation such as this one involving the Sphinx
moduledirective... _label_1a: .. module:: module1 text Refer label_1a_I added the last line, so that here an attempt is made to refer
label_1abut the latter never made it to the output.texfile.
Yes.
The reason is, that in rare cases the references.PropagateTargets transform fails to
transfer the ids from the target to the next element — the <target> element keeps its ids attribute and does not gain a refid attribute.
In HTML, these empty targets are converted to empty anchors: <div id="label-1a"></div>. In LaTeX, they should generate a \label{label-1a}.
To recap simply doing
s/pass/return/passes all tests, but going more radical and ignoring all nodes having arefidis not a possibility for us.
I disagree: given the Doctree elements
<target refid="label-1a">
<target ids="module1 label-1a" ismod="True" names="label_1a">
the writer should produce no label for the first target, but two labels for the second target.
Does, by any chance, the Sphinx LaTeX writer still remove some ids from specific <target> nodes?
@jfbu Looking at https://github.com/sphinx-doc/sphinx/blob/master/sphinx/writers/latex.py line 1852 and following:
if node.get('ismod', False):
# Detect if the previous nodes are label targets. If so, remove
# the refid thereof from node['ids'] to avoid duplicated ids.
...
ids.remove(prev['refid']) # type: ignore[index]
...
I see that the Sphinx LaTeX writer indeed does it the wrong way round:
given the Doctree elements
<target refid="label-1a">
<target ids="module1 label-1a" ismod="True" names="label_1a">
the writer does produce one label for the first target and one label for the second target.
This means that in order to fix the issue, the patch should not only be "more radical and ignoring all nodes having a refid" but also remove the complete "temporary fix" in the if node.get('ismod', False): ... block.
Thanks @gmilde.
For background I am pasting here some successive versions of (only the last lines of) visit_target().
16 years ago at https://github.com/sphinx-doc/sphinx/commit/6a17e7d1a69cb90545eba66c707810d161aad724
if 'refuri' in node:
return
if node.get('refid'):
add_target(node['refid'])
for id in node['ids']:
add_target(id)
6 years ago at https://github.com/sphinx-doc/sphinx/commit/b7c679626a652a4abe35633415c8665d1886950c
if 'refuri' in node or 'refid' in node or 'refname' in node:
# skip indirect targets (external hyperlink and internal links)
return
for id in node['ids']:
add_target(id)
6 years ago at https://github.com/sphinx-doc/sphinx/commit/9ecf45e6baa86dddfe38491ba5003b7ee9747e52
if 'refuri' in node:
return
if node.get('refid'):
prev_node = get_prev_node(node)
if isinstance(prev_node, nodes.reference) and node['refid'] == prev_node['refid']:
# a target for a hyperlink reference having alias
pass
else:
add_target(node['refid'])
for id in node['ids']:
add_target(id)
2 years ago at 61576516d4ed6c4e10d38f959e8625201abcc7d3
if 'refuri' in node:
return
if 'anonymous' in node:
return
if node.get('refid'):
prev_node = get_prev_node(node)
if isinstance(prev_node, nodes.reference) and node['refid'] == prev_node['refid']:
# a target for a hyperlink reference having alias
pass
else:
add_target(node['refid'])
# Temporary fix for https://github.com/sphinx-doc/sphinx/issues/11093
# TODO: investigate if a more elegant solution exists (see comments of #11093)
if node.get('ismod', False):
# Detect if the previous nodes are label targets. If so, remove
# the refid thereof from node['ids'] to avoid duplicated ids.
def has_dup_label(sib: Element | None) -> bool:
return isinstance(sib, nodes.target) and sib.get('refid') in node['ids']
prev: Element | None = get_prev_node(node)
if has_dup_label(prev):
ids = node['ids'][:] # copy to avoid side-effects
while has_dup_label(prev):
ids.remove(prev['refid'])
prev = get_prev_node(prev)
else:
ids = iter(node['ids']) # read-only iterator
else:
ids = iter(node['ids']) # read-only iterator
for id in ids:
add_target(id)
Now, trying to interpret your advice about fixing the issue and removing the "temporary fix", I tried with [^1]
if 'refuri' in node:
return
if 'anonymous' in node:
return
if node.get('refid'):
prev_node = get_prev_node(node)
if (
isinstance(prev_node, nodes.reference)
and node['refid'] == prev_node['refid']
):
# a target for a hyperlink reference having alias
return
else:
add_target(node['refid'])
for id in node['ids']:
add_target(id)
and all our latex tests pass (actually test_build_latex.py but surely also the mark-up ones [^3]) both with 0.21.2 and r10153 [^2] :muscle:. The sphinx.tex obtained from building our own docs is unchanged when using 0.21.2; with Docutils at r10153 it is improved compared to using our current master and building with Docutils r10153 into not having the two non-sensical label insertions. I think this goes into the direction of your suggestions of "a more radical" trimming of our visit_target() removing all of the "temporary fix". I tried trimming more with no success.
[^1]: there is more at beginning of visit_target(), skipped in all the above reproduced lines.
[^2]: the latter with 48 warnings though, where there were none with 0.21.2.
[^3]: I now have tried the full test suite using Docutils 0.21.2 and Python 3.13 at my locale and got only FAILED tests/test_extensionstest_ext_autodoc_importer.py::test_import_native_module_stubs - ModuleNotFoundError: No module named 'fish_licence.halibut' ===== 1 failed, 2332 passed, 10 skipped, 11 warnings in 184.00s (0:03:03) ======
off-topic but when I see this:
\index{module@\spxentry{module}!module1@\spxentry{module1}}\index{module1@\spxentry{module1}!module@\spxentry{module}}\phantomsection\label{\detokenize{index:label-1a}}\phantomsection\label{\detokenize{index:label-1b}}\phantomsection\label{\detokenize{index:module-module1}}
we should be more economical with the \phantomsection's if possible. Only one suffices, and multiple \label's can follow it.