rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Mergeable rustdoc cross-crate info

Open EtomicBomb opened this issue 1 year ago • 26 comments

This RFC is to enable merging cross-crate information (like the search index, index.html, source files index) from separate output directories, so that rustdoc can run in parallel in non-cargo build systems.

Currently, rustdoc updates these cross-crate files with the information from a new crate by reading the old version, adding the information for the current crate, and writing the same file back. This causes issues in build systems that disallow multiple build actions targeting the same output files. Cargo supports cross-crate information because it doesn't have this restriction, Buck2 and Bazel do not support cross-crate information, and Fuchsia has a doc step that runs rustdoc serially on 2700 crates.

Supporting parallel rustdoc with CCI as a native step in Buck2, Bazel, and GN (Fuchsia + Chrome) would require mergable cross-crate information, which is what this proposal is about. It adds new flags, --include-info-json, --write-info-json, --merge=none|read-write|write-only and --include-rendered-docs, to opt-in to these new features.

Rendered

EtomicBomb avatar Jun 20 '24 20:06 EtomicBomb

cc @rust-lang/rustdoc

GuillaumeGomez avatar Jun 20 '24 21:06 GuillaumeGomez

I'm very in favor of this! In addition to the parallel building use case, I think it will make it much more feasible to support documentation of namespaced crates as a group. We discussed this a bit on Zulip last month. And merging the parts after doc generation is generally a nicer design.

For implementation details: I'd like for the cross crate info to be written in JSON, not JS, and for the merge step to generate the JS. The JS as it stands is just a very thin callback wrapper around JSON. We do this so docs can be loaded from file:// URLs (where fetching JSON files is not supported). But we don't need the merge step to understand the JS.

Some bikeshedding about the flags: I don't like the names --write-rendered-cci and --read-rendered-cci because they're long, because they contain an acronym that will be opaque to all but rustdoc implementers (cci), and because it's not clear to me that all four combinations of (true, false) values are valid or useful.

It seems to me that the merge step is effectively a subcommand, except that since rustdoc doesn't have subcommands it's triggered by a combination of --write-rendered-cci and --fetch-parts. Is that right?

Could you add a section to the RFC about compatibility? For compatibility on disk, I'm assuming we won't support merging cross crate info that was generated by disparate versions of rustdoc.

As an alternate design, what about this:

  • unconditionally start generating a crate-info.json in the root of any crate's doc build.
  • crate-info contains all the different kinds of potentially cross crate info (src files, implementers, etc)
  • merge step reads crate-info.json from N different crates and generates .js in the appropriate places
  • New flag: --merge=[auto|none|<list of paths>]
  • auto is the default value, and causes the equivalent of the current behavior. At the end of each run, rustdoc takes a lock and reads/writes from the out dir to do merge.
  • none is the future default. Individual rustdoc invocations do no merging; it is up to the build system to do a final merge step. - <list of paths> causes the merge step to be invoked and no other work to be done. Causes most other rustdoc flags to be ignored.

jsha avatar Jun 22 '24 15:06 jsha

I would prefer not to put large files in the target/doc folder that aren't actually used by the frontend. People should be able to upload that folder to their webhost without having to filter out subfolders to avoid wasting space.

Instead, I'd design the CLI with two parameters:

  • --write-info-json=PATH: When supplied, the crate info json (search index, search desc shards, search type param names, foreign trait impls, type aliases, and source code file tree) gets written to a JSON file supplied by PATH. Cargo will use a path that's somewhere in target/, but not in the target/doc folder, so publishers don't accidentally upload files to their web host that don't actually need to be there. Other build systems can use a location that happens to be convenient.
  • --include-info-json=PATH1,PATH2,...: When --write-info-json isn't supplied, the info files are written the same way they are now. If --include-info-json is supplied, rustdoc will additionally include the data from the supplied paths.

notriddle avatar Jun 22 '24 16:06 notriddle

@jsha, I love that there are more use cases than we initially anticipated!

I agree that the intermediate cross-crate info should be written in JSON, and for the merge step to generate JS (and HTML where applicable)! The current proposal / WIP writes intermediate files as JSON (src-files-js, search-index-js, search-desc, all-crates, crates-index, type-impl, trait-impl) with --parts-out-dir=PATH. It renders the JS + HTML versions upon --write-rendered-cci (default). It reads to and appends to CCI in the doc root upon --read-rendered-cci (default), to match the current behavior of rustdoc. I'll put the JSON files into a single crates-info.json and change the names of the options so they make more sense.

Here are what the combinations of --read-rendered-cci and --write-rendered-cci meant. Yes, only some of them are useful, so I agree that a --merge=[] will make more sense.

  • --read-rendered-cci=false --write-rendered-cci=false --parts-out-dir=target/doc.parts: write the pre-rendered cross-crate information parts files. Merge these with another crates version with another invocation of rustdoc --fetch-parts <...>
  • --read-rendered-cci=false --write-rendered-cci=false (leaving out --parts-out-dir): Does not read or write any cross-crate information. The user is only interested in item docs.
  • --read-rendered-cci=false --write-rendered-cci=true: write the cross-crate information for the current crate to src-files.js, search-index.js, type.impl/**/*.js, ignoring the files that are already there. This user may have a hermetic build system that monitors reads.
  • --read-rendered-cci=true --write-rendered-cci=false: not useful - we only read CCI to append to existing CCI
  • --read-rendered-cci=true --write-rendered-cci=true: rustdoc's current behavior

I initially conceptualized merging as a subcommand, but you could also see it as an incremental enhancement: if you're calling rustdoc on one of your binaries or libraries, you still might want to link-in the crate-info.json for other crates, without an additional rustdoc invocation. For pure merges though, we could make something that looks more like a merge subcommand: maybe a third party tool that effectively merges into a dummy crate with an empty lib.rs?

It seems like --include-info-json=PATH1,PATH2,... is similar to what this proposal means by --fetch-parts. And --write-info-json=PATH is like --parts-out-dir, and that crate-info.json is like the concatenation of {src-files-js, search-index-js, search-desc, all-crates, crates-index, type-impl, trait-impl}.

Based off of @notriddle and @jsha's feedback, I will update the RFC + WIP with these items:

  • Replace {src-files-js, search-index-js, search-desc, all-crates, crates-index, type-impl, trait-impl} into a single file crate-info.json
  • Rename --fetch-parts to the suggested --include-info-json
  • Rename --parts-out-dir to the suggested --write-info-json
  • Create a flag --merge=auto to replace --read-rendered-cci=true --write-rendered-cci=true
  • Create a flag --merge=none to replace --read-rendered-cci=false --write-rendered-cci=false
  • Create a flag --merge=no-read to replace --read-rendered-cci=false --write-rendered-cci=true
  • Describe the compatibility promises: crate-info.json is only stable within a version of rustdoc. The CCI after this proposal is not byte-for-byte identical with those produced by previous versions of rustdoc,but should have the same effect when executed by a browser (modulo sorting order).

I'll have to work on these suggestions on Monday, but definitely give me more feedback. Thank you for engaging with this proposal!

EtomicBomb avatar Jun 22 '24 19:06 EtomicBomb

Updated the RFC with the items mentioned in my previous comment. I chose the name --merge=write-only instead of --merge=no-read

EtomicBomb avatar Jun 25 '24 00:06 EtomicBomb

I updated the RFC + WIP implementation so that --include-info-json and --write-info-json take a path directly to a crate-info.json file.

The existing tests/rustdoc* suite passes my WIP implementation. That hopefully means that no changes will need to be made to tools that use rustdoc.

I hacked compiletest to pass --merge=none + --write-info-json to the auxiliary crates and --merge=write-only + --include-info-json to the main crate for every test. The tests/rustdoc* (except rustdoc-gui - have not run these) tests pass under this mode too. These tests were for my confidence, as I'm not looking to make any changes to compiletest.

EtomicBomb avatar Jun 28 '24 00:06 EtomicBomb

If there are no further blockers for this, can we put this RFC into FCP?

djkoloski avatar Jul 15 '24 20:07 djkoloski

Sorry it took so long to come back to this. So overall, the idea behind the RFC is to maximize parallel work (which is great). Just wanted to note that in such a case:

A
|
B
|
C

Here A depends on B which depends on C. If you update B, you just regenerate its content and you're done. However if you update A afterward, you need to recompile B to generate its .rmeta files. So the gain is not really that big in this case, if any.

Anyway, being able to build documentation in case there are a lot of crates will still be a big improvement. The blocker will then become the same as when compiling: building the crate and all its dependencies.

GuillaumeGomez avatar Jul 16 '24 09:07 GuillaumeGomez

I added an --include-rendered-docs=<path/to/target/doc/extern-crate-name> flag that will copy the docs from an externally documented crate to the current --out-dir.

EtomicBomb avatar Jul 16 '24 18:07 EtomicBomb

Updated the RFC and my implementation with --include-rendered-docs, a version key in crate-info.json, and --merge=read-write.

EtomicBomb avatar Jul 17 '24 17:07 EtomicBomb

Sounds all good for me. I think we can start the FCP then.

@rfcbot fcp merge

GuillaumeGomez avatar Jul 19 '24 16:07 GuillaumeGomez

Team member @GuillaumeGomez has proposed to merge this. The next step is review by the rest of the tagged team members:

  • [x] @GuillaumeGomez
  • [x] @Manishearth
  • [ ] @Nemo157
  • [x] @aDotInTheVoid
  • [x] @camelid
  • [ ] @fmease
  • [x] @jsha
  • [x] @notriddle

Concerns:

  • ~~complexity~~ resolved by https://github.com/rust-lang/rfcs/pull/3662#issuecomment--1992227884

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot avatar Jul 19 '24 16:07 rfcbot

@rfcbot concern complexity

I'm very much in favor of the broad brushstrokes of this idea, especially given the real-world improvements it would have on the UX for non-Cargo users. However, I'm concerned about the complexity of the interface we would expose to control rustdoc's CCI behavior. This makes things more involved for users of rustdoc's CLI (e.g., maintainers of build systems), and potentially restricts future enhancements to the implementation and features of CCI. I would much prefer if we could stabilize a simpler, more declarative interface that directly solves the problem of parallel builds with independent output directories that support a lightweight final merge operation.

Specifically, it seems like with this proposal there are three modes of invoking rustdoc:

  1. Document any crate using a shared directory with global mutable CCI that is continuously updated. The CCI file(s) contain all crates' data in a flat structure. This is exactly how rustdoc currently works.
  2. Document a regular (i.e., non-root/non-index) crate using a dedicated HTML output directory (target/doc/my-crate[^cargo], like currently) and a dedicated "parts" output directory (target/doc.parts/my-crate). No CCI data nor rendered HTML output is included from other crates.
  3. Document a root/index crate that serves merely to merge the docs from each crate in the workspace. Only rendered HTML and finalized CCI is generated (into a target/doc/my-root-crate folder). No incremental/parts CCI is generated (i.e., no target/doc.parts/my-root-crate). CCI parts are included into the finalized CCI from the appropriate dependency crates. (Note that my description here is a bit different compared to the RFC text because here the dependencies' rendered HTML would not be copied. It was unclear to me why that would be needed.)

It seems like these modes could be accomplished using a reduced set of flags (names here are just examples):

  1. --merge=shared, --out-dir=target/doc/my-crate
  2. --merge=none, --out-dir=target/doc/my-crate, --parts-out-dir=target/doc.parts/my-crate
  3. --merge=finalize, --out-dir=target/doc, --include-parts-dir=target/doc.parts/my-crate1 --include-parts-dir=target/doc.parts/my-crate2 ...

This would potentially also be better if we didn't require an explicit index crate, though it could also cause further complexity. Either way, we shouldn't preclude removing the requirement for an explicit index crate in the future.

[^cargo]: I'm using the naming convention that Cargo would use, but of course these would all be based on what the user passed in CLI flags.

camelid avatar Jul 31 '24 03:07 camelid

About the complexity concern, I am on board with most of the suggestions! As far as I can tell, the diff from the current RFC is relatively minimal?

  • Rename --merge=none|read-write|write-only to --merge=none|shared|finalize
  • --write-info-json=target/doc.parts/my-crate/crate-info.json -> --parts-out-dir=target/doc.parts/my-crate
  • --include-info-json=target/doc.parts/my-crate/crate-info.json -> --include-parts-dir=target/doc.parts/my-crate
  • Clarify the three modes / workflows in the text of the RFC
  • Remove the copying functionality

Note that my description here is a bit different compared to the RFC text because here the dependencies' rendered HTML would not be copied. It was unclear to me why that would be needed

Users would need to copy and merge rendered HTML if they document their crates in separate --out-dirs, and are trying to generate an index that includes docs from a crate and its dependencies. Typically, you use the same target/doc directory for all crates, but I have been writing the rfc to support a scenario where users are potentially documenting their crates in separate directories on the same filesystem, or even separate machines. As a final step, to generate the static site, the rendered HTML must be copied and merged.

Guillaume suggested that rustdoc should be batteries-included, and have a minimal impl of copying rendered docs to the --out-dir (--include-rendered-docs).

I think we should keep the --include-rendered-docs flag. Users that want to do more advanced merging (only copying when a stamp file is out of date, copying across networks, hard linking, reflink, etc.) will not provide --include-rendered-docs and just do it themselves. The flag serves as a demonstration of the kind of distributed workflows that this RFC is intended to support. It also will be useful in situations where users do not want to / cannot use the external tools.

I will make the other changes in the morning!

EtomicBomb avatar Jul 31 '24 05:07 EtomicBomb

I changed the following to simplify the interface as suggested by @camelid:

  • Rename --merge=none|read-write|write-only to --merge=none|shared|finalize
  • --write-info-json=target/doc.parts/my-crate/crate-info.json -> --parts-out-dir=target/doc.parts/my-crate
  • --include-info-json=target/doc.parts/my-crate/crate-info.json -> --include-parts-dir=target/doc.parts/my-crate
  • Clarify the section that talks about the three modes of operation in the text of the RFC

b0dc37d moves the section about --extern-html-root-url from unresolved questions to the reference-level explanation, since we do not intend to change its behavior or provide new options for generating relative URLs to externally documented crates. --extern-html-root-url seems to serve all use cases considered by the RFC.

The commit also removes the section about no_emit_shared since all flag names that have been discussed are better than that.

EtomicBomb avatar Jul 31 '24 17:07 EtomicBomb

Thank you for the review Manish! I made some small changes in response!

  • Summarize current behavior of rustdoc in Reference-level explanation
  • Flag summary in Guide-level explanation
  • Say that crate-info.json is stable because build systems need to rely on it
  • In example: t,s,i -> trait-crate, struct-crate, index-crate
  • In example: T, S -> Trait, Struct

EtomicBomb avatar Aug 02 '24 06:08 EtomicBomb

Thank you for the review @camelid! I addressed the new feedback:

  • Remove the --include-rendered-docs flag
  • --parts-out-dir restricted to --merge=none
  • --include-parts-dir restricted to --merge=finalize
  • crate-info.json -> crate-info to emphasize opaqueness
  • Explain that doc.parts is just the suggested name
  • doc.parts contents edition stability promises are replaced with a statement saying that rustdoc will make a reasonable effort to continue just having crate-info files

EtomicBomb avatar Aug 03 '24 05:08 EtomicBomb

First off, thanks for this well written and thorough doc. I found myself nodding along as I read it.

I realized I left an unstated assumption in my initial comment: I think this style of docs generation should definitely be adopted by Cargo. I think doing so may help with come of @camelid's complexity concerns, which I share. I think if we integrate this into Cargo and deprecate the old mode, we will actually reduce rustdoc complexity overall.

Specifically, as @camelid says:

  1. Document any crate using a shared directory with global mutable CCI that is continuously updated. The CCI file(s) contain all crates' data in a flat structure. This is exactly how rustdoc currently works.

Our goal should be to support this mode temporarily, while we work on updating Cargo to support a "merge" step for cargo doc. Once that's in, we can deprecate this mode and eventually remove it.

One implication of Cargo adopting this mode is that we need rustdoc to be able to merge without creating an index crate, so that cargo doc in a workspace can work correctly. This is mentioned under Unresolved questions: Index crate?. I think it should be resolve in favor of "no index crate."

Note that there is prior art for a rustdoc mode that isn't crate specific: --emit=unversioned-shared-resources, which is used by docs.rs to avoid having a copy of main.js, rustdoc.css, and friends for each crate. Incidentally, I think we should modify Cargo to use this mode as well, so we have more consistency across different use cases. So Cargo would build each crate separately, then run two finishing steps, one to merge the cross-crate info and one to emit the shared resources. Am I right in assuming Buck2 and Bazel use --emit=unversioned-shared-resources already?

I proposed that rustdoc should unconditionally emit the cross-crate info into each crate's documentation directory. @notriddle replied:

I would prefer not to put large files in the target/doc folder that aren't actually used by the frontend. People should be able to upload that folder to their webhost without having to filter out subfolders to avoid wasting space.

I think the cross crate info file will always be small relative to the overall size of a crate's docs, and it's okay if it's stored alongside the rest of the docs. The benefit we get is that we don't need a new flag --parts-out-dir. And build systems have correspondingly less complexity integrating with rustdoc.

jsha avatar Aug 12 '24 23:08 jsha

I think the cross crate info file will always be small relative to the overall size of a crate's docs, and it's okay if it's stored alongside the rest of the docs. The benefit we get is that we don't need a new flag --parts-out-dir. And build systems have correspondingly less complexity integrating with rustdoc.

If you don't expect the size overhead to be a problem, then this makes sense, and I support making the CLI simpler.

notriddle avatar Aug 13 '24 04:08 notriddle

@jsha: I proposed that rustdoc should unconditionally emit the cross-crate info into each crate's documentation directory.

We should expect the size of crate-info to grow in proportion with the size of the crate's docs, since it includes all cross-crate trait implementations and the entire search index. For doc indexes with many crates, we should expect this file to be very large. @dtolnay describes how a search index for all of the crates in the doc index maintained by his team would be several gigabytes in size.

I think that build systems would strongly prefer for intermediate files to be separate from the output files. In Bazel, the rustdoc --out-dir is currently represented with ctx.actions.declare_directory. If we force doc.parts into the --out-dir, we can't tell Bazel that the --out-dir is actually a mix of intermediates and outputs. The names of the children of the --out-dir aren't known prior to rustdoc being run, because of #![crate_name = "..."]. The --out-dir has to be entirely opaque to the build system.

I think --parts-out-dir and --include-parts-dir are required in order to separate the intermediate state and the outputs.

EtomicBomb avatar Aug 13 '24 04:08 EtomicBomb

Thank you for the thoughtful feedback, @jsha!

I changed the following in response:

  • New directory: doc.parts/<crate-name>/crate-info -> New directory: doc.parts.
  • Number of times --parts-out-dir, --merge, and --include-parts-dir can be provided are mentioned.
  • Describe why doc.parts is a directory.
  • Description of multiple parallel invocations of rustdoc.
  • Describe why to run rustc before rustdoc.
  • Fix typos.

Tomorrow, I will:

  • Resolve in favor of removing the requirement for an index crate. The biggest obstacle here is actually --enable-index-page, because the scaffold + markdown rendering aspect of the index page depends on the html::render::Context, the edition, and some other info gathered from the compiler and having a target crate. There are some implementation details to work out but I think it should be fine.
  • I will also describe that this RFC is intended to enable the future deprecation of --mode=shared.

EtomicBomb avatar Aug 13 '24 05:08 EtomicBomb

I updated the RFC to resolve in favor of not needing an index crate to merge docs in a workspace, as explained above.

EtomicBomb avatar Aug 16 '24 02:08 EtomicBomb

We should expect the size of crate-info to grow in proportion with the size of the crate's docs, since it includes all cross-crate trait implementations and the entire search index. For doc indexes with many crates, we should expect this file to be very large. @dtolnay describes how a search index for all of the crates in the doc index maintained by his team would be several gigabytes in size.

That's the merged search index, though - not each crate's search index.

I think that build systems would strongly prefer for intermediate files to be separate from the output files.

This is a good point. It would perhaps make sense to consider all of the regular rustdoc outputs to be intermediates, which are merged in the merge step. But maybe that introduces an excessive number of filesystem operations? At any rate, in the interests of unblocking the work I'm willing to concede this point and say the extra flag is okay.

jsha avatar Aug 20 '24 14:08 jsha

@rfcbot reviewed @rfcbot resolve complexity

camelid avatar Aug 21 '24 18:08 camelid

:bell: This is now entering its final comment period, as per the review above. :bell:

rfcbot avatar Aug 21 '24 18:08 rfcbot

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

rfcbot avatar Aug 31 '24 18:08 rfcbot

We should discuss this before stabilization. In the meantime, this RFC needs to get actually merged, so I'll do that now. :)

camelid avatar Sep 21 '24 21:09 camelid

Huzzah! The @rust-lang/rustdoc team has decided to accept this RFC.

To track further discussion, subscribe to the tracking issue here: rust-lang/rust#130676

camelid avatar Sep 21 '24 21:09 camelid