sphinx
sphinx copied to clipboard
Add a new ``references`` builder
This PR add a new builder references, to build a single references.json, which provides a mapping for almost* all targets available to reference in the project, including:
- Internal domain objects, generated within the current project
- External domain objects, loaded from the
objects.invconfigured viaintersphinx_mapping(when using thesphinx.ext.intersphinxextension)
* I say almost, because this assumes the objects returned from domain.get_objects account for the majority of referencable items in a project, but there are currently some notable exceptions, like the math domain not returning any (but that is for another PR to fix)
This partialy addresses #12152, to allow for a clear way for users to understand:
- What targets are available for them to reference
- How to reference these targets
Crucially, the references.json includes the mapping of object type to role names (this can be one-to-many),
since a role name is required for the reference syntax, not the object type.
I would also invisage other tools (like VS Code extension) could utilise this, to provide things like auto-completions, and "jump to target/references"
Some considerations:
-
I feel a builder is really the only way to do this comprehensively; having a standalone CLI (like the current
python -m sphinx.ext.intersphinx) can only get you so far, and then you will have to start re-implementing features of a normal sphinx build (like reading configuration, etc) -
Perhaps in a follow up PR I could introduce a complimentary CLI, that reads the
references.jsonand allows users to quickly generate references. Something likesphinx-ref find 're.Match'returning:class:`~re.Match`(i.e https://github.com/orgs/sphinx-doc/discussions/12152#discussioncomment-8862652) -
There are cases where an
object typehas no matchingrole names, this PR is not addressing that (although I want to eventually) -
As I mention in #12152, it would be ideal for this to include, not just the document path where a local target is defined, but also the line number (if available). But this is not within the scope of this PR
-
Creating a singular
references.jsonis probably the simplest way to do this. But, it could get rather large, for a large project, or one with lots of intersphinx mappings. Is this ok, or do we think another format would be better, like one JSON file per domain / object type, or even something like an sqlite database file? -
The other non included in this PR, is any additions to the documention, I could do this here or in a follow-up PR
(cc also @webknjaz, as I can't add you as a reviewer)
(test failure is likely because of a side effect)
(test failure is likely because of a side effect)
yeh hmm works locally (when calling the singular test), but I perhaps I can't "piggy-back" on the existing test-basic folder
anyway, whilst I fix that, interested to hear your thoughts
yeh hmm works locally (when calling the singular test), but I perhaps I can't "piggy-back" on the existing test-basic folder
If you are worried about that, use srcdir=os.urandom(16).hex() in the sphinx marker. It's a way to isolate your test so that you don't have weird surprises (well you could still have surprises but you should be VERY unlucky (or lucky, if you were an adversary targetting AES-128)).
A bit of comments (I'll be less available from now)
Thanks for the review @picnixz, but perhaps I could nudge you for some quick general feedback on the concept 😅
Do you agree that this is a "good" thing to introduce? any thoughts on the references.json format?
Read through the new references.py builder. I'm weak on some of the technical details and Sphinx internals, so I can't speak strongly there.
But, here are some other thoughts.
Reaction to the 'generate a complete local & remote references list' idea --- +0.25.
It might be helpful having all targets, local or intersphinx, in one artifact? But after thinking about it, I don't think it's very important to me, personally---and, it seems to me the bigger problem is the object-type lossiness of the current v2 objects.inv format. (Or, at least the way in which Sphinx currently builds to that format.)
I think I would rather have better/more accurate information about the targets in my intersphinx-referenced docsets---which would require a new inventory format, as best I figure---than a list of all local and remote references, where the info I get on the remote references in that all-in-one artifact requires as much work to transform into a working cross-reference as the info I can get out of sphobjinv does.
If I'm trying to reference something in another project, I know which project it is, and I don't mind pointing a single-docset tool at that project's docs. (And, there's a good chance I might prefer that single-docset tool if I don't have to mess with an intermediate data file as part of the process.) The 'all in one place' aspect of this may have a broad appeal, but it's less important to me, personally.
Reaction to the layout of references.json --- overall +0.5 or so, with thoughts/caveats.
For automated ingestion of reference data, this schema seems great. :+1:
Coming from a sphobjinv-biased perspective, my primary use case is, "I have this thing X that I want to cross-reference; how do I do that?"
So, from a data mapping perspective, what I want to be able to do with the output of this is to walk from [object name] -> [object reference].
I like the sound of the sphinx-ref find ... tool you proposed, but what happens if it doesn't do the search I want?
The current semantics of references.json are exactly backward for manual REPL exploration: it'll take a beefy, nested list comprehension to search through it for target names.
That said, using the right tool -- jsonpath-ng, say -- probably would make that search relatively straightforward. (Though, it would be more complex if the JSON gets broken up into multiple files.)
Choice of references.<ext> Format
If there's eventually a sphinx-ref find, I don't think the format of the output matters too much. As long as it's a standard, open format, anybody who wants to can interface with it. Format thoughts:
- JSON would probably be the simplest format
- Likely the easiest for manual exploration
- Though the filesize question is real for large docsets, especially given that
references.jsonwould include all transitive references tointersphinxprojects- All targets from the entire Python docs included in every
references.jsonbuilt...
- All targets from the entire Python docs included in every
- SQLite does seem like a good option, giving a more compact file and sqlite is in stdlib
- Manual exploration would be considerably more cumbersome, though
- The schema would take some figuring out -- performance isn't a huge issue
- One giant table, with
domainandobject_typecolumns? - One table per
domain, withobject_typecolumns? (Probably best?) - One table per
domain/object_typecombo? (Probably way too many tables)
- One giant table, with
- Maybe
tinydb? SQLite-like, but document database- Not in the stdlib, so it'd be a dependency both for Sphinx and for anyone trying to read it independently
- But it fits the hierarchical data shape better, and it would be easier for manual exploration
I'll comment tomorrow (for this one, I need a bit of sleep)
A core problem is the use of domain.get_objects(). As alluded to in https://github.com/orgs/sphinx-doc/discussions/12152#discussioncomment-8877586 there is an inherent problem in intersphinx in that it assumes in knows how to write and read declared entities from each domain. The reading was mostly delegated to the domains, but the writing has not been yet.
Essentially I think we should figure out this delegation, including a new inventory/references format, before building more on top of the old problematic formats.
Currently get_objects() is used for only two purposes, as far as I can see: creating the index and creating inventories. The former is fine, as the fullname and dispname are only used for display purposes.
For inventories the fullname needs to encode all information about the entity in a string, so it can be loaded in again. This is not convenient for languages like C++ where the scoping information can be rather complex.
If I'm not mistaken, then this references builder is very similar to the inventory generation in its used of get_objects().
Since we are talking about a new Intersphinx format, I would like you to also think about how to serialize the entries in the inventories, especially concerning #11932. After reading Jakob's argument, I also think that domains should be responsible for serializing their intersphinx part however they see fit. It could also solve multiple issues that I could not necessarily find when implementing #11932 but if each domain knows how to properly represent their references in intersphinx, it would be better.
In addition, we could change the format of a specific domain (e.g., if there are bugs) without affecting the format of other domains. I suggest using the same approach as for ELF where there is a header section containing the location of each program section. Then each domain would serialize its own intersphinx inventory and intersphinx would only be responsible for merging the parts together. Then, each domain would deserialize its dedicated section and recover its references mapping.
The references builder you are suggesting would be responsible to normalize each domain output in a more human-readable format. In the JSON output, you would include "human-readable" entries + an offset and buffer size to the serialized data in the objects.inv binary file. That way, you can use it to recover a single referencable entity, and using in a standalone manner as well.
Essentially I think we should figure out this delegation, including a new inventory/references format, before building more on top of the old problematic formats. Since we are talking about a new Intersphinx format
See https://github.com/orgs/sphinx-doc/discussions/12204