sphinx icon indicating copy to clipboard operation
sphinx copied to clipboard

Allow `singlehtml` to embed all assets into a single html file

Open choldgraf opened this issue 3 years ago • 3 comments

Context

The singlehtml builder is a way to combine all pages of the HTML documentation into a single HTML file. This is really useful if you want to quickly share your documentation with other people in a way that they can quickly open in their computer.

However, one challenge is that singlehtml builds also tend to come with a lot of external assets in the local filesystem. For example, CSS files, JS files, etc are packaged alongside the single HTML file. This makes the benefits of "a single HTML file" less powerful, because now you have a collection of files that you need to ship together. It is much more difficult to, for example, attach them to emails and such.

Suggestion

Make it possible (or document how if it is already possible) to make singlehtml builds also include the asset content inside of the HTML file, so that it is a single, self-contained file that can be easily sent to somebody else, and can open anywhere without any external files (at least, local ones).

References

Some folks have also requested this in the Jupyter Book project here:

  • https://github.com/executablebooks/jupyter-book/issues/1046

choldgraf avatar Jul 21 '22 04:07 choldgraf

I'm not sure how to realize that. But it would be very nice if we'll do it.

tk0miya avatar Jul 23 '22 18:07 tk0miya

I've been thinking about this for some time as well.

Side note

On a side note, I believe that this is what many people actually want when they distribute PDFs. We like PDFs for the following features:

  1. Contained in a single file
  2. Everybody can view them
  3. Looks the same on every device, even when printed
  4. Sometimes: Can't easily be modified

However, I argue that number 3 is actually not what most people want, because PDFs are rarely printed these days in my experience. Instead, you get page breaks and line breaks on a dynamic medium like a screen instead of re-flowing the text. That is particularly noticeable when viewing PDFs on a small screen like a phone. Even when viewing them in a desktop browser, they open in a new application or they open in the browser using an inferior PDF renderer. CTRL-F and other things behave slightly different, copying text is not guaranteed to work as expected, etc. Or try copying a large table that spans several pages. It's just a mess. PDF often fails to look consistent across different media anyway, in particular when printing.

So if we're willing to loosen number 3, we might as well use HTML. Everybody has a browser, and browsers have perfected flowing the text according to the screen size. They can even be interactive. Copying a large table could be a single click, you could even choose between CSV, XLSX or whatever.

I suppose ebooks were kind of meant to solve some of these issues, but almost nobody has an ebook viewer and few are willing or able to install software just to view a document. Also, ebooks are not interactive by design.

Unfortunately, HTML was never meant to be self-contained. In fact, it obviously makes a lot of sense to only have references to external resources, because you can save space and bandwidth. I tried some experiments anyway.

Idea

First, I don't like the SingleHTML builder, because larger documents feel sluggish in the browser. Even on my system, the sphinx docs built as a single html are sluggish. The DOM tree is too big.

So my general idea is this:

  1. Embed everything in HTML files. Images as data URIs, CSS inside style tags and JS inside script tags. Note that CSS files can include other resources like fonts and other CSS files.
  2. Fix all links that don't point to external resources. This will require some extra JS.
  3. Now that we are left with HTML files only, put them in a dictionary structure. Encode this as JSON and write it in a new HTML file. I call this the virtual file tree.
  4. The files are now huge because everything is embedded multiple times, so include pako.js and unzip the structure.
  5. Dynamically build the document inside a single page application using Javascript.

However, I ran into issues with Sphinx. It's not clear to me when in the build process we could embed everything. Some static resources are built from templates at the very end (pygments.css is rendered from a jinja template, for example), so we can't embed them with the Translator class or before that. I noticed that some files like graphviz.js appear to be copied even after finish().

And the templates rely on the context being available, so they can't be copied to the build directory earlier. Some templates also include assets directly (example) which would need to be adjusted if we wanted to do this.

Since I experimented with this earlier using mostly Javascript, I applied the same approach here. It has the advantage of including assets in the virtual file tree only once (except CSS files), besides working on all templates.

You can try yourself using my fork. Build with make html O='-D html_embed_assets=True'. Two examples (the Sphinx docs and the MyST Parser docs) are attached here (zipped as Github does not allow HTML).

Issues

  • You can see the font not being immediately loaded, which causes jumps.
  • It interferes with other Javascript files which expect to be loaded only once. Here they can be loaded more than once and attempt to re-declare consts, for example. The pydata-sphinx-theme.js also appears to interfere with scrolling.
  • I decided to use two anchors (#) in URLs to make it clear that the pages are "virtual", but using onclick events it might not actually be needed. We could perhaps subtract the base URL differently.
  • Probably tons of other issues that arise from the assumption that HTML assets are distributed.

So do you think this approach is worthwhile to explore further?

AdrianVollmer avatar Sep 04 '22 16:09 AdrianVollmer

I fixed a few minor issues and forked the v4.5.0 branch:

Also, I created a few more proof of concepts (file sizes are the HTML file compared to the output of the html builder):

The search and equations don't work at all yet, which I'd really like to fix. I expect other scripts to break as well, but it might work well enough for some folks. I'm thinking about releasing it as an extension.

AdrianVollmer avatar Sep 11 '22 17:09 AdrianVollmer

I solved the issues. Since the code uses almost no Sphinx internals, it is much more natural to release it as an extension: https://github.com/AdrianVollmer/Zundler

AdrianVollmer avatar Oct 12 '22 16:10 AdrianVollmer

@AdrianVollmer Thanks a lot for creating this extension! I was excited to see this happening and I wanted to give it a try on Read the Docs. My idea was to expose it as another output format that people could download for offline usage. I created a small demo at https://test-builds.readthedocs.io/en/sphinx-docs-zundler/ that builds the Sphinx documentation and exposes it. However, for some reason when opening the documentation file generated with zundler, it opens a lot of pop-ups and tries to download the files. Am I doing something wrong?

(I'm happy to open an issue in the Zundler repository for this if you consider)

humitos avatar Oct 18 '22 10:10 humitos

Thanks for checking it out!

I built the Sphinx docs here and it works flawlessly from what I can tell: https://adrianvollmer.github.io/Zundler/output/sphinx.html Not sure what is creating these blob: links, because I'm not seeing those in the document.

I believe we should have this discussion in an issue of the Zundler repo. ~~Please include the commit hash of the Sphinx repo from which you build the docs so I can try to reproduce the issue. Also, does grep -r 'blob:' Sphinx/doc yield any results?~~ Nevermind, I see that you included everything on that page. I'll take a look.

AdrianVollmer avatar Oct 18 '22 11:10 AdrianVollmer

Since all the previous propositions didn't work (@AdrianVollmer Zundler segmentation faulted for me), I managed to create an inlined version of the singlehtml output with the following Makefile:

# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS    ?=
SPHINXBUILD   ?= sphinx-build
SOURCEDIR     = .
BUILDDIR      = public

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

singlehtml:
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
	node inline.js
	mv $(BUILDDIR)/singlehtml/inline.html $(BUILDDIR)/singlehtml/index.html
	

serve: html
	python -m http.server -d $(BUILDDIR)/html
 

The following inline.js:

const posthtml = require("posthtml");
const posthtmlInlineAssets = require("posthtml-inline-assets");
const fs = require("fs");

const html_file = __dirname + "/public/singlehtml/index.html";

let html = fs.readFileSync(html_file).toString();

const root = __dirname + "/public/singlehtml";

posthtml([
  posthtmlInlineAssets({
    cwd: root,
    root: root,
    errors: "throw",
    transforms: {
      style: {
        resolve(node) {
          return (
            node.tag === "link" &&
            node.attrs &&
            node.attrs.rel === "stylesheet" &&
            node.attrs.href.split("?")[0]
          );
        },
      },
      script: {
        resolve(node) {
          return (
            node.tag === "script" &&
            node.attrs &&
            node.attrs.src &&
            !node.attrs.src.startsWith("https://")
          );
        },
      },
      image: {
        resolve(node) {
          return (
            node.tag === "img" &&
            node.attrs &&
            !node.attrs.src.startsWith("data:") &&
            node.attrs.src
          );
        },
      },
    },
  }),
])
  .process(html)
  .then((result) =>
    fs.writeFileSync(__dirname + "/public/singlehtml/inline.html", result.html)
  );

And the following package.json

{
  "devDependencies": {
    "posthtml-inline-assets": "^3.1.0",
    "postcss": "^8.4.31",
    "posthtml": "^0.16.6",
    "posthtml-cli": "^0.10.0"
  },
  "dependencies": {}
}

based on @choldgraf comments in the corresponding issue of Executable Notebook.

rdbisme avatar Oct 27 '23 12:10 rdbisme