sphinx [intersphinx] support for arbitrary title names

Fix #11711.

Since the logic is different, this required a complete update on how intersphinx inventories are represented. Before, each line in an inventory was of the form

name [SPACE] domain [COLON] objtype [SPACE] priority [SPACE] URI [SPACE] display name

However, if name contains a whitespace (which is the case in the linked issue), then it becomes extremely hard to have a "good" regex. Now, we represent records as follows:

priority ([COLON] name_length) name [SPACE] domain [COLON] objtype [SPACE] URI [SPACE] display name

where the name's length (and the colon) is only included if the name contains a space. In particular, the inventory's size is not changed if the name has no space inside. Otherwise, we increase a bit the inventory size with a 64-bit integer for those entries.

Feb 03 '24 15:02 picnixz

Quick update after discussing with someone:

If a project builds its documentation with the new format, then any other project using that project needs to be built using the latest sphinx version otherwise they won't be able to read the new inventory format.
I think this bug already exists for inventories built using v1 vs v2 so it's not really a problem I'd say. However, we could possibly have a configuration value for the format of the inventory we are actually expecting.

Feb 20 '24 09:02 picnixz

technically, this one is compatible with the current version but... I'm not really sure it should land in 7.3 (when would 7.3 even release?). So I'll convert it into a draft and we'll merge it in 8.x (after dropping the intersphinx format, though it's unrelated).

Mar 14 '24 11:03 picnixz

I don't think this change to the inventory semantics is necessary.

The most recent regex I implemented for sphobjinv, which incorporated a couple of tweaks from https://github.com/bskinn/sphobjinv/issues/181 and https://github.com/bskinn/sphobjinv/issues/256, has been working in the wild for two years now.

It successfully imports all objects from the minimal example provided by OP at #11711. After a pip install sphobjinv in a virtualenv:

>>> import sphobjinv as soi
>>> inv = soi.Inventory(url='https://thosse.github.io/Playground_GH_Pages/objects.inv')
>>> len(inv.objects)
18
>>> from pathlib import Path
>>> len(Path("objects.txt").read_text().splitlines()) - 4  # Minus 4 b/c the header lines aren't objects
18

That decoded inventory, for reference:

# Sphinx inventory version 2
# Project: Pages Playground
# Version: 
# The remainder of this file is compressed using zlib.
genindex std:label -1 genindex.html Index
index std:doc -1 index.html Intersphinx Minimal Example
index:1 2 3 fails std:label -1 index.html#fails 1 2 3 Fails
index:1 fails 1 std:label -1 index.html#fails-1 1 Fails 1
index:1 works std:label -1 index.html#works 1 Works
index:123 works std:label -1 index.html#id1 123 Works
index:fails 1 2 3 std:label -1 index.html#fails-1-2-3 Fails 1 2 3
index:fails 1 2 fails std:label -1 index.html#fails-1-2-fails Fails 1 2 Fails
index:fails 1 fails 2 std:label -1 index.html#fails-1-fails-2 Fails 1 Fails 2
index:fails fails 1 std:label -1 index.html#fails-fails-1 Fails Fails 1
index:fails fails 2 fails fails std:label -1 index.html#fails-fails-2-fails-fails Fails Fails 2 Fails Fails
index:headers without digits work std:label -1 index.html#headers-without-digits-work Headers without Digits work
index:intersphinx minimal example std:label -1 index.html#intersphinx-minimal-example Intersphinx Minimal Example
index:works 1 std:label -1 index.html#works-1 Works 1
index:works 1 works std:label -1 index.html#works-1-works Works 1 Works
modindex std:label -1 py-modindex.html Module Index
py-modindex std:label -1 py-modindex.html Python Module Index
search std:label -1 search.html Search Page

It's of course possible that my regex is still missing some key features. But I've spent a lot of time working on this reverse-engineering, and I think it's pretty solid based on the lack of tickets in the sphobjinv tracker, which is lately seeing ~40K downloads per month.

There are a few more issues in sphobjinv related to the semantics of the object-reference lines in the v2 objects.inv. I'd be happy to collect those together if it would be helpful.

Mar 25 '24 16:03 bskinn

@bskinn Thank you for your help but there are some titles such as word word1:word2 integer that would fail your regex since word1:word2 would be incorrectly matched as domain:reftype. While I don't have a specific example of such title, I cannot exclude that possibility (that's the reason why I gave up on using regexes). However, would you be willing to check that this regex could also extend your regex (and handle my weird case)?

pattern = re.compile(r'''^
	(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+?)
	\s+
	(?P<domain>[^\s:]+):(?P<reftype>\S+)
	\s+
	(?P<prio>-?\d+)
	\s+?
	(?P<uri>\S*)
	\s+
	(?P<dispname>.+?)
	\r?
$''', re.M | re.X)

I would have wanted to use negative lookbehinds but only .NET regex parser supports variable-width patterns.

There are a few more issues in sphobjinv related to the semantics of the object-reference lines in the v2 objects.inv. I'd be happy to collect those together if it would be helpful.

I'd be glad to have it. It could help a lot for #12204.

Mar 26 '24 08:03 picnixz

but there are some titles such as word word1:word2 integer that would fail your regex since word1:word2 would be incorrectly matched as domain:reftype.

Indeed, that is a point of failure.

I can take a look at your regex sometime soon, sure, but---I had an idea. What if we changed the paradigm of the format a bit further?

The key hangup as far as I can tell is that the single-line object format doesn't allow for an unambiguous separator character sequence, given the possible domain (not domain) of names and dispnames.

What if an uncompressed v3 inventory were something like this:

# Sphinx inventory version 3
# Project: MyProject
# Version: 0.1.0
# Object count: 4
# Uncompressed inventory MD5: abcdef123789654
# The remainder of this file is compressed using zlib.
word1:word2 1
  std
  label
  1
  foo/bar.html
  -

word1 word2:word3 2
  std
  label
  1
  foo/bar.html
  -

word1 1 word2:word3 2
  std
  label
  1
  foo/bar.html
  -

foobar
  rst
  directive:option
  1
  baz/quux.html
  -

This format would provide <NEWLINE><SPACE><SPACE> as a delimiter between fields, and <NEWLINE><NEWLINE> as a delimiter between objects. (The ability to use <NEWLINE> as part of the delimiter is the key feature here, as none of the fields for a given object reference permit newlines, as best I know.

This format would take a lot more vertical space when decompressed, but ... so what? In nearly all uses, humans won't be reading it, and it'll be zlib-compressed anyways. Even in the few cases where someone is looking directly at it, it still seems quite readable to me.

And, even though there are extra characters in these delimiters, the zlib compression should work quite effectively on them.

(I also tossed an object count and and MD5 hash into the header to improve the options for validity checking.)

What do you think?

Mar 26 '24 13:03 bskinn

What if we changed the paradigm of the format a bit further?

Ideally yes: it's better to change the format so that we don't have possible ambiguities. An alternative for a simpler visual (without new lines) is to encode the length of the title (and there will be no need for a regex, or a fancy separator that could in the future exist (though I doubt newlines will ever be a valid title)).

I should check (but I don't have time) the exact specs of DEFLATE in stream mode but I'm pretty sure that the size shouldn't blow up (could you check the bytes size of the compressed inventories with and w/o newlines as separators and with titles with whitespaces?)

Mar 26 '24 14:03 picnixz

Definitely true, including the name length as you propose is probably about the most minimal change achievable, and wouldn't dramatically change the flow of line parsing.

Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:

>>> import sphobjinv as soi
>>> inv = soi.Inventory(url="https://docs.python.org/3/objects.inv")
>>> import zlib
>>> ( len_v2 := len(zlib.compress(inv.data_file(), 9)) )  # Compress the whole thing for simplicity
136182
>>> def newline_fmt(do):
...   return f"{do.name}\n  {do.domain}\n  {do.role}\n  {do.priority}\n  {do.uri}\n  {do.dispname}"
...
>>> file_newlines = b'\n'.join(inv.data_file().splitlines()[:4]) + b'\n' + ('\n\n'.join(newline_fmt(o) for o in inv.objects)).encode()
>>> print(file_newlines.decode()[:245])
# Sphinx inventory version 2
# Project: Python
# Version: 3.12
# The remainder of this file is compressed using zlib.
CO_FUTURE_DIVISION
  c
  member
  1
  c-api/veryhigh.html#c.$
  -

METH_CLASS
  c
  macro
  1
  c-api/structures.html#c.$
  -

>>> ( len_nls := len(zlib.compress(file_newlines, 9)) )
138880
>>> len_nls - len_v2
2698
>>> len_nls/len_v2
1.019811722547767

Impact is likely greater for smaller inventories, due to the smaller number of repeats of the \n and \n\n motifs. But, for smaller inventories, the size is already not a problem.

Mar 26 '24 15:03 bskinn

Re your regex, I think you have a typo in the first line:

(?P<title>(?:.|[^\s:]+:\S+\s+-?\d+\s*)+)?

Is that trailing ? supposed to be inside that final )? Where it's located now, I read it as making the entire title group optional.

Mar 26 '24 15:03 bskinn

Is that trailing ? supposed to be inside that final )? Where it's located now, I read it as making the entire title group optional.

Ah, yes thank you, it's a typo (I did my regex this morning and I already forgot about it)

Mar 26 '24 15:03 picnixz

Quick size analysis, switching to my newline-delimited proposal would increase the current Python 3.12 compressed inventory by about 2%:

I see. If we include the title length, I think we'll probably have less (in the end, it's as if you had a longer title with at most 3 characters (because, who in their right mind would have a title of more than 1000 characters)).

Mar 26 '24 15:03 picnixz

If we include the title length, I think we'll probably have less

Yes, agreed, yours is much more concise.

because, who in their right mind would have a title of more than 1000 characters

Yessss, goodness.

I like your proposal far better than continuing work to find a more robust regex. (I rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.)

Mar 26 '24 15:03 bskinn

rather suspect that it might always be possible to find a pathological example for any regex we might devise, and any time spent attempting to refine the regex is likely better spent on something else.

This is clearly something that I'd agree with. As such I will close this PR and make a new one. For a more precise parsing, I think we can even encode the length of each field instead and completely get rid of any regex (this will save us a lot of work!). Note that we can even drop the spaces in the format ! (although we could want to keep them for testing purposes). So, I have two suggestions:

encode the length of each field on a separate line, e.g., (name_len, domain_len, reftype_len, prio (this is an integer so it can directly be included!), uri_len). And the line following after would be the current line that we have.
encode the length of each field at the beginning of each line (same as above, but leave the priority integer where it is currently).

I'm not sure which one would be the most compact but I like the first one because tests will be easier to write (and debug). I'm not sure if doubling the number of lines would have a direct impact.

However, we need to synchronize whatever decision we make with #12204 to avoid creating an unusable format. Jakob's suggested that each domain is responsible for dumping and loading their own intersphinx fragment and I suggested using a similar construction as in ELF where each domain would have a dedicated section where they would write whatever they want and they would know what to do. I think, we can introduce this line-encoded format as a "default" encoding if a domain does not have a specific way to encode its information or if it supports the default logic.

Mar 26 '24 15:03 picnixz

I'd be glad to have it. It could help a lot for #12204

Ok, here's what I turned up, in no particular order:

Possible colon in role:
- https://github.com/bskinn/sphobjinv/issues/256
- https://github.com/bskinn/sphobjinv/pull/258
Empty dispname is NOT possible
- https://github.com/bskinn/sphobjinv/issues/190
- https://github.com/bskinn/sphobjinv/pull/189
  - This PR also closes the following three issues, 181/182/183
Priority must be an integer
- https://github.com/bskinn/sphobjinv/issues/182
URI can be absent
- https://github.com/bskinn/sphobjinv/issues/183
Names can have spaces (this is not news 😅)
- https://github.com/bskinn/sphobjinv/issues/181
Encodings
- https://github.com/bskinn/sphobjinv/issues/104
- https://github.com/bskinn/sphobjinv/issues/103#issuecomment-628690714
In the header, it's possible for each of project and version to be blank
- https://github.com/bskinn/sphobjinv/issues/85
Why sphobjinv doesn't support v1 objects.inv files
- https://github.com/bskinn/sphobjinv/issues/69
Discussion of automatic directive -> role transformation in suggest output
- https://github.com/bskinn/sphobjinv/issues/234

I also have an entire test module in sphobjinv dedicated to testing values that do/don't survive passage through sphinx.util.inventory.InventoryFile. It's very likely that there are errors here, but they're thorough/correct enough that they handle much of what's out in the wild.

Mar 29 '24 16:03 bskinn

sphinx sphinx copied to clipboard

[intersphinx] support for arbitrary title names

sphinx
sphinx copied to clipboard