Crystal docs on large API shards creates many large files
Currently with Crystal docs, if you have a large shard it will generate huge html files.
If we take the mjblack/win32cr shard and development/docs branch as an example, it will create a lot of html files that are 10mb each. I have not been able to finish creating docs as the size of the docs directory becomes hundreds of gigabytes (not an exaggeration) and fills up my windows development VM.
Discussing with @straight-shoota on Discord, he mentioned that each nav entry is about 250 bytes and the shard has 38k documented types.
So the problem is that there are 38k types, each of them generates a separate file and each file contains a nav menu that links to all 38k files. The quadratic effect leads to about 350 GB of data just for the nav menu. 🤯 (in that context, the size of the actual content is absolutely negligible).
This is already quite noticeable at smaller scale, though.
stdlib has ~700 documented types, and the nav menu totals about 110 kB. That's a total of 77 MB duplicate content in the stdlib API docs. For comparison, the total size of stdlib API docs is 127 MB (including 11 MB of index.json).
If we could avoid this duplication, we'd cut the size of stdlib API docs to less than half.
Related: #11427
Ideally, I'd like to use something like https://turbo.hotwired.dev/ to keep the sidebar as a separate, reusable HTML file which can be embedded into every other file client side. Such a change would be almost entirely behind-the-scenes and shouldn't be perceivable to the user. Instead of turbo, we could also use a simple custom JavaScript for embedding, of course.
Unfortunately, this mechanism requires an HTTP server and won't work when files are opened locally.
Perhaps we could use an <iframe> solution for local files? Links in the sidebar would need target="_parent" for that.
It would be nice if that could work as an implicit fallback, so the same build can be used either locally or via HTTP.
Another solution could be to reduce the content of the sidebar menu to only the links that are directly accessible from the current page. In order to navigate into a different tree, you'd need to first navigate to the root pages in that path. That means the little triangle would no longer open the submenu on the current page, but load the associated page where the submenu would be expanded.
This would obviously degrade UX. So it's probably not a good idea.
An iframe would just work for every scenarios.
There could be a simple JS helper to avoid a page reload (do nothing if location scheme is file:) but as much as I like Turbo, it sounds overkill for the job.
I submitted a PoC for an iframe sidebar: #15734
There might still be some minor issues, but I have managed to get everything working across frames (keyboard navigation was a bit of a challenge, but seems to be good now).
Coincidentally, I stumbled across this technique, which could work very well for this: https://www.filamentgroup.com/lab/html-includes/#another-demo%3A-including-another-html-file
<iframe src="sidebar.html" onload="this.before((this.contentDocument.body).children[0]);this.remove()"></iframe>
Compared to #15734, this would be much more similar to the current setup and would keep the diff smaller (no changes to CSS and JS necessary).
This is such a smart trick 😈
Still, maybe just the CSS so disabled or no JS will just render the sidebar properly with some style in the iframe.
Yeah, we should make sure it works without JS.
Hm, the <iframe> idea is smart, but there's a problem: relative path resolution.
We use relative paths, so internal links always work, regardless of which path the docs directory is mounted at. URLs inside sidebar.html are relative to ./, URLs in Some/Nested/Type.html are relative to ./Some/Nested/. When we copy links from the sidebar.html iframe into the document of Some/Nested/Type.html they'll be broken because they refer to different bases.
We need to harmonize the base paths.
When we copy the iframe content, we could modify the href attributes in all its links to match the base path of the parent document. This should work, but it requires extra effort on every page load. When considering the OP use case with thousands of types (and thus thousands of links in the sidebar), it might become a performance issue.
Certainly, it should be easier if we need no extra effort on the client side. So going the other way, we could make all links (including those in Some/Nested/Type.html) relative to ./ and set the base path for the entire document to ./ (<base href="../../">). That could even simplify the code because currently for each link we calculate the relative path between origin and target page. With this change, all links would always be relative to ./.
And this almost works as expected, but the base href also affects fragment-only links such as <a href="#foo">, which would resolve to ../../#foo instead of the current page.
To fix this, we must make really all URLs relative to the base href ../../, i.e. include the path to the current page. That's easy for URLs created by the doc generator (we generate a couple of fragment links). But we must also take care of URLs in user content, i.e. the rendered doc comments. This includes both explicit links in the markdown source, and the auto-generated links to headlines.
We need to process and relativize them. This task is related to #13753 because this would be another reason for using https://github.com/straight-shoota/sanitize: It includes a URISanitizer which can relative all URIs to a base path (including fragment-only URLs).
Btw. considering the "works without JS" aspect: The current sidebar is very limited without JS support. Expanding entries requires JS. So it's basically just a list of the top-level types and there's no easy path to navigate to nested items in the namespace.
Can't we flatten the file structure? Instead of having /Some/Nested/Type.html, make it /Some.Nested.File.html, using any character that's not valid in a type identifier?
I suppose that could be possible. But it might break some things 🤔
I would prefer a solution with less drastic changes, though. And using the root directory as base URL on all files should totally be doable. We just need to figure out how to properly adapt URLs in user-generated content, while avoiding dependency hell.
We could even use Some::Nested::File.html as file name. That would be identical to the actual path name in Crystal. Maybe that's not a bad idea after all... 🤔
I don't know if there was a particular reason for building a hierarchy in the file system.
Some/Nested/File.html better aligns with the structure of source files. However, it's only a convention to put that type into some/nested/file.cr.
Some issues: it would break every external links to the docs and we'd need more URL rewrites, a version switch would need to know both schemes, ...
Yeah, I'd definitely prefer to stay with the current file structure if we can get <base> working.
Considering the challenge with normalizing links, this might not be a big problem at all.
✅We can easily transform URLs generated by the doc generator itself.
# Implicit link to `Foo` class.
✅We can easily transform URLs in markdown links by hooking into the markdown renderer.
# [bar](#foo)
❌ The only trouble are HTML links in user generated content. Those would require to parse HTML in order to process them.
# <a href="#foo">bar</a>
But this seems like a very rare edge case. Not sure if anyone ever uses this. Usually, markdown syntax should be preferred. The only reason to use HTML syntax in a markdown document would be to implement something that markdown does not support. For API docs content, that seems rather rare. If there is some odd case where this is needed, it should be acceptable to add another workaround for the URL scope on top of using HTML as a workaround.