sphinx
sphinx copied to clipboard
[intersphinx] support for arbitrary title names
Fix #11711.
Since the logic is different, this required a complete update on how intersphinx inventories are represented. Before, each line in an inventory was of the form
name [SPACE] domain [COLON] objtype [SPACE] priority [SPACE] URI [SPACE] display name
However, if name contains a whitespace (which is the case in the linked issue), then it becomes extremely hard to have a "good" regex. Now, we represent records as follows:
priority ([COLON] name_length) name [SPACE] domain [COLON] objtype [SPACE] URI [SPACE] display name
where the name's length (and the colon) is only included if the name contains a space. In particular, the inventory's size is not changed if the name has no space inside. Otherwise, we increase a bit the inventory size with a 64-bit integer for those entries.
Quick update after discussing with someone:
- If a project builds its documentation with the new format, then any other project using that project needs to be built using the latest sphinx version otherwise they won't be able to read the new inventory format.
- I think this bug already exists for inventories built using v1 vs v2 so it's not really a problem I'd say. However, we could possibly have a configuration value for the format of the inventory we are actually expecting.
technically, this one is compatible with the current version but... I'm not really sure it should land in 7.3 (when would 7.3 even release?). So I'll convert it into a draft and we'll merge it in 8.x (after dropping the intersphinx format, though it's unrelated).
I don't think this change to the inventory semantics is necessary.
The most recent regex I implemented for sphobjinv
, which incorporated a couple of tweaks from https://github.com/bskinn/sphobjinv/issues/181 and https://github.com/bskinn/sphobjinv/issues/256, has been working in the wild for two years now.
It successfully imports all objects from the minimal example provided by OP at #11711. After a pip install sphobjinv
in a virtualenv:
>>> import sphobjinv as soi
>>> inv = soi.Inventory(url='https://thosse.github.io/Playground_GH_Pages/objects.inv')
>>> len(inv.objects)
18
>>> from pathlib import Path
>>> len(Path("objects.txt").read_text().splitlines()) - 4 # Minus 4 b/c the header lines aren't objects
18
That decoded inventory, for reference:
# Sphinx inventory version 2
# Project: Pages Playground
# Version:
# The remainder of this file is compressed using zlib.
genindex std:label -1 genindex.html Index
index std:doc -1 index.html Intersphinx Minimal Example
index:1 2 3 fails std:label -1 index.html#fails 1 2 3 Fails
index:1 fails 1 std:label -1 index.html#fails-1 1 Fails 1
index:1 works std:label -1 index.html#works 1 Works
index:123 works std:label -1 index.html#id1 123 Works
index:fails 1 2 3 std:label -1 index.html#fails-1-2-3 Fails 1 2 3
index:fails 1 2 fails std:label -1 index.html#fails-1-2-fails Fails 1 2 Fails
index:fails 1 fails 2 std:label -1 index.html#fails-1-fails-2 Fails 1 Fails 2
index:fails fails 1 std:label -1 index.html#fails-fails-1 Fails Fails 1
index:fails fails 2 fails fails std:label -1 index.html#fails-fails-2-fails-fails Fails Fails 2 Fails Fails
index:headers without digits work std:label -1 index.html#headers-without-digits-work Headers without Digits work
index:intersphinx minimal example std:label -1 index.html#intersphinx-minimal-example Intersphinx Minimal Example
index:works 1 std:label -1 index.html#works-1 Works 1
index:works 1 works std:label -1 index.html#works-1-works Works 1 Works
modindex std:label -1 py-modindex.html Module Index
py-modindex std:label -1 py-modindex.html Python Module Index
search std:label -1 search.html Search Page
It's of course possible that my regex is still missing some key features. But I've spent a lot of time working on this reverse-engineering, and I think it's pretty solid based on the lack of tickets in the sphobjinv
tracker, which is lately seeing ~40K downloads per month.
There are a few more issues in sphobjinv
related to the semantics of the object-reference lines in the v2 objects.inv
. I'd be happy to collect those together if it would be helpful.
@bskinn Thank you for your help but there are some titles such as word word1:word2 integer
that would fail your regex since word1:word2
would be incorrectly matched as domain:reftype
. While I don't have a specific example of such title, I cannot exclude that possibility (that's the reason why I gave up on using regexes). However, would you be willing to check that this regex could also extend your regex (and handle my weird case)?
pattern = re.compile(r'''^
(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+?)
\s+
(?P<domain>[^\s:]+):(?P<reftype>\S+)
\s+
(?P<prio>-?\d+)
\s+?
(?P<uri>\S*)
\s+
(?P<dispname>.+?)
\r?
$''', re.M | re.X)
I would have wanted to use negative lookbehinds but only .NET regex parser supports variable-width patterns.
There are a few more issues in sphobjinv related to the semantics of the object-reference lines in the v2 objects.inv. I'd be happy to collect those together if it would be helpful.
I'd be glad to have it. It could help a lot for #12204.
but there are some titles such as
word word1:word2 integer
that would fail your regex sinceword1:word2
would be incorrectly matched asdomain:reftype
.
Indeed, that is a point of failure.
I can take a look at your regex sometime soon, sure, but---I had an idea. What if we changed the paradigm of the format a bit further?
The key hangup as far as I can tell is that the single-line object format doesn't allow for an unambiguous separator character sequence, given the possible domain (not domain
) of name
s and dispname
s.
What if an uncompressed v3 inventory were something like this:
# Sphinx inventory version 3
# Project: MyProject
# Version: 0.1.0
# Object count: 4
# Uncompressed inventory MD5: abcdef123789654
# The remainder of this file is compressed using zlib.
word1:word2 1
std
label
1
foo/bar.html
-
word1 word2:word3 2
std
label
1
foo/bar.html
-
word1 1 word2:word3 2
std
label
1
foo/bar.html
-
foobar
rst
directive:option
1
baz/quux.html
-
This format would provide <NEWLINE><SPACE><SPACE>
as a delimiter between fields, and <NEWLINE><NEWLINE>
as a delimiter between objects. (The ability to use <NEWLINE>
as part of the delimiter is the key feature here, as none of the fields for a given object reference permit newlines, as best I know.
This format would take a lot more vertical space when decompressed, but ... so what? In nearly all uses, humans won't be reading it, and it'll be zlib-compressed anyways. Even in the few cases where someone is looking directly at it, it still seems quite readable to me.
And, even though there are extra characters in these delimiters, the zlib compression should work quite effectively on them.
(I also tossed an object count and and MD5 hash into the header to improve the options for validity checking.)
What do you think?
What if we changed the paradigm of the format a bit further?
Ideally yes: it's better to change the format so that we don't have possible ambiguities. An alternative for a simpler visual (without new lines) is to encode the length of the title (and there will be no need for a regex, or a fancy separator that could in the future exist (though I doubt newlines will ever be a valid title)).
I should check (but I don't have time) the exact specs of DEFLATE in stream mode but I'm pretty sure that the size shouldn't blow up (could you check the bytes size of the compressed inventories with and w/o newlines as separators and with titles with whitespaces?)
Definitely true, including the name length as you propose is probably about the most minimal change achievable, and wouldn't dramatically change the flow of line parsing.
Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:
>>> import sphobjinv as soi
>>> inv = soi.Inventory(url="https://docs.python.org/3/objects.inv")
>>> import zlib
>>> ( len_v2 := len(zlib.compress(inv.data_file(), 9)) ) # Compress the whole thing for simplicity
136182
>>> def newline_fmt(do):
... return f"{do.name}\n {do.domain}\n {do.role}\n {do.priority}\n {do.uri}\n {do.dispname}"
...
>>> file_newlines = b'\n'.join(inv.data_file().splitlines()[:4]) + b'\n' + ('\n\n'.join(newline_fmt(o) for o in inv.objects)).encode()
>>> print(file_newlines.decode()[:245])
# Sphinx inventory version 2
# Project: Python
# Version: 3.12
# The remainder of this file is compressed using zlib.
CO_FUTURE_DIVISION
c
member
1
c-api/veryhigh.html#c.$
-
METH_CLASS
c
macro
1
c-api/structures.html#c.$
-
>>> ( len_nls := len(zlib.compress(file_newlines, 9)) )
138880
>>> len_nls - len_v2
2698
>>> len_nls/len_v2
1.019811722547767
Impact is likely greater for smaller inventories, due to the smaller number of repeats of the \n
and \n\n
motifs. But, for smaller inventories, the size is already not a problem.
Re your regex, I think you have a typo in the first line:
(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+)?
Is that trailing ?
supposed to be inside that final )
? Where it's located now, I read it as making the entire title
group optional.
Is that trailing ? supposed to be inside that final )? Where it's located now, I read it as making the entire title group optional.
Ah, yes thank you, it's a typo (I did my regex this morning and I already forgot about it)
Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:
I see. If we include the title length, I think we'll probably have less (in the end, it's as if you had a longer title with at most 3 characters (because, who in their right mind would have a title of more than 1000 characters)).
If we include the title length, I think we'll probably have less
Yes, agreed, yours is much more concise.
because, who in their right mind would have a title of more than 1000 characters
Yessss, goodness.
I like your proposal far better than continuing work to find a more robust regex. (I rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.)
rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.
This is clearly something that I'd agree with. As such I will close this PR and make a new one. For a more precise parsing, I think we can even encode the length of each field instead and completely get rid of any regex (this will save us a lot of work!). Note that we can even drop the spaces in the format ! (although we could want to keep them for testing purposes). So, I have two suggestions:
- encode the length of each field on a separate line, e.g.,
(name_len, domain_len, reftype_len, prio (this is an integer so it can directly be included!), uri_len)
. And the line following after would be the current line that we have. - encode the length of each field at the beginning of each line (same as above, but leave the priority integer where it is currently).
I'm not sure which one would be the most compact but I like the first one because tests will be easier to write (and debug). I'm not sure if doubling the number of lines would have a direct impact.
However, we need to synchronize whatever decision we make with #12204 to avoid creating an unusable format. Jakob's suggested that each domain is responsible for dumping and loading their own intersphinx fragment and I suggested using a similar construction as in ELF where each domain would have a dedicated section where they would write whatever they want and they would know what to do. I think, we can introduce this line-encoded format as a "default" encoding if a domain does not have a specific way to encode its information or if it supports the default logic.
I'd be glad to have it. It could help a lot for #12204
Ok, here's what I turned up, in no particular order:
- Possible colon in role:
- https://github.com/bskinn/sphobjinv/issues/256
- https://github.com/bskinn/sphobjinv/pull/258
- Empty dispname is NOT possible
- https://github.com/bskinn/sphobjinv/issues/190
- https://github.com/bskinn/sphobjinv/pull/189
- This PR also closes the following three issues, 181/182/183
- Priority must be an integer
- https://github.com/bskinn/sphobjinv/issues/182
- URI can be absent
- https://github.com/bskinn/sphobjinv/issues/183
- Names can have spaces (this is not news 😅)
- https://github.com/bskinn/sphobjinv/issues/181
- Encodings
- https://github.com/bskinn/sphobjinv/issues/104
- https://github.com/bskinn/sphobjinv/issues/103#issuecomment-628690714
- In the header, it's possible for each of
project
andversion
to be blank- https://github.com/bskinn/sphobjinv/issues/85
- Why
sphobjinv
doesn't support v1objects.inv
files- https://github.com/bskinn/sphobjinv/issues/69
- Discussion of automatic directive -> role transformation in
suggest
output- https://github.com/bskinn/sphobjinv/issues/234
I also have an entire test module in sphobjinv
dedicated to testing values that do/don't survive passage through sphinx.util.inventory.InventoryFile
. It's very likely that there are errors here, but they're thorough/correct enough that they handle much of what's out in the wild.