pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

Feature Request: Add support for chunked (multiple file) HTML and HTMLHelp.

Open dm413 opened this issue 4 years ago • 35 comments

It would be useful if Pandoc could produce multiple output files by splitting the output based on sections (header) levels. The output files should maintain links across files, and the table of contents should link to all files.

  • Chunked HTML output would produce a set or folder of HTML files. This is useful for generating static websites (for example).

  • HTMLHelp output is a compressed version of chunked HTML specific to windows. The way this is done by other tools (such as doxygen) is to generate a folder of chunked HTML along with a HTMLHelp project file and content file. And perhaps an index file, but I don't think Pandoc has a built-in concept of index terms, so I would skip this for now. These files are then run thru HTMLHelp Workshop, a Microsoft tool that is used to generate the HTMLHelp file.

    HTMLHelp has its own pane for the TOC, generated from the content file. The content file should respect the pandoc toc-depth setting. Since there is a separate TOC pane, the normal TOC at the top of the file should be suppressed by default.

You could also consider adding these as input formats. For chunked HTML, the issues seem to be what order to read the files, and making sure the links are correctly handled. For HTMLHelp (on Windows), the HTMLHelp reader can split a HTMLHelp (chm) file into the original discrete files for further processing in the same way as chunked HTML.

Note that the already supported epub format is another version of a chunked html format.

This issue has been raised in the pandoc-discuss mailing list. Various ideas have been proposed, including:

  • Add "Next" and "Previous" links to each HTML output page. This probably needs to be an optional feature.

  • Extend the idea of chunked output to formats other than HTML. For example, individual chapters sent to separate ODT or DOCX files (or RST, markdown, etc.).

dm413 avatar Feb 06 '20 02:02 dm413

This is not really just HTML. You may want to chunk up a large Markdown file into smaller Markdown files too.

bpj avatar Feb 06 '20 08:02 bpj

Hm, maybe the first step would be writing a format-independent function

splitIntoChunks :: FilePath -> Int -> Pandoc -> [(FilePath, Pandoc)]

where the Int parameter is the heading level to split at, and the FilePath is a file path template to be used (e.g. chapter-{{ number }}.html, where the {{ number }} will be replaced by the chunk number, or {{ heading }}.html where {{ heading }} will be replaced by the full heading text (stringified), or {{ identifier }}.html, where {{ identifier }} will be replaced by the identifier on the heading.

This function would split up the document into sections and rewrite any internal links so that they point to the correct paths. Not a hard thing to write.

Perhaps there should also be an option for adding "next," "previous," and "up" links to each chunk, as in the HTML output produced by texinfo? We could use arrows instead of the words "Next", "Previous", and "Top" to avoid English-centrism?

Just adding this to Shared would be helpful. Then we'd need to think about how to integrate it onto the command line. Perhaps the simplest approach would be this: if the output file is FILE.zip, then pandoc will create a zip file with chunked output in the specified output format (template FILE-#.FORMAT). So -t rst -o my.zip would produce a zip of chunked RST files, for example. A separate command line option could be provided to set the level for splitting, like the current --epub-chapter-level but more general. (Indeed, --epub-chapter-level could then be deprecated and replaced with this.)

jgm avatar Feb 06 '20 16:02 jgm

This would be very useful.

Outputting a zip file is simple, but the first thing any makefile or batch file is going to have to do is unzip it in order to further process it. How about just specifying a folder name instead of a file name? If the folder doesn't exist already perhaps you could add a trailing slash or backslash to indicate that it's a folder.

Or maybe just an option that means output a folder of files instead of one file (that is, chunked). This is more verbose, but it's clearer what you are doing.

An option for next, previous, and up links (using arrows) would be nice.

For HtmlHelp, we also need to create the project (.hhp), content (.hhc), and index (.hhk) files. Perhaps HtmlHelp is a a separate issue, and if you want me to create a new issue specifically for it I can do so. But any HtmlHelp writer will need to make use of the chunked html output option, so it's good to think about how to integrate both of these into the command line. For that matter, epub output is related as well.

dm413 avatar Feb 07 '20 15:02 dm413

How about just specifying a folder name instead of a file name? If the folder doesn't exist already perhaps you could add a trailing slash or backslash to indicate that it's a folder.

That's a possibility. I like the idea of keeping the simple invariant that pandoc produces one file, but I can see this would be m ore convenient.

jgm avatar Feb 07 '20 16:02 jgm

An option for next, previous, and up links (using arrows) would be nice.

I think this is a job for the template, assuming each file would be run through the template separately. Pandoc could add metadata fields this-file: NAME, prev-file: NAME, next-file: NAME so that people can include and design those links if and as they want them in the template.

bpj avatar Feb 07 '20 18:02 bpj

I think this is a job for the template

That makes sense to me!

jgm avatar Feb 07 '20 20:02 jgm

In order to facilitate building static sites (or dumping to templates used in static sites such as jekyll or hugo), it would be useful to be able to specify a pattern for the output.

For example, I might want to run the command like:

pandoc -f markdown -t html5 \
  --chunks chapters --chunk-dest ~/projects/some-site/templates/my-book/ \
  {first,last,second}-chapter.md

And the output would be:

templates/
├── first-chapter.html
├── last-chapter.html
└── second-chapter.html

Or, there might be a way to specify other patterns so that someone could use config like

chunk-name: '{{ section[0]["name"] }}/{{ section[1]["name"] }}{{ ext }}'

To get output like first-chapter/first-section.html, first-chapter/second-section, etc.

hakan-geijer avatar May 20 '20 19:05 hakan-geijer

It may make sense for Pandoc to work with trees of files instead of single streams:

  • A writer would produce a tree of files.
    • In the case of HTML that would be HTML files and (extracted) image files.
  • Users of Pandoc could then decide whether to write such a tree to the file system, into a ZIP file, etc.

rauschma avatar May 29 '20 20:05 rauschma

This functionality might also be useful in filters.

zspitz avatar Dec 08 '20 11:12 zspitz

I'm very excited for this possibility. Does being in "next release" mean that it is actually decided to implement it?

jtbayly avatar Apr 05 '21 18:04 jtbayly

I'm afraid the "next release" tag has been aspirational so far... I would like to implement this, but it's going to take some thought.

jgm avatar Apr 05 '21 18:04 jgm

Understood. That’s why I asked. Thanks so much for all your wonderful work.

jtbayly avatar Apr 05 '21 19:04 jtbayly

It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs

The HTML output is split into different files and crossreferences work.

I guess this tells me there's some way of doing this now ... any ideas how?

ricopicone avatar Apr 26 '21 00:04 ricopicone

It apparently happens here.

I don't know R and it's 1100 lines ... there's a lot going on here.

ricopicone avatar Apr 26 '21 00:04 ricopicone

Fwiw, somebody made a pretty comprehensive filter-based version of multiple-output html files that fixes crossreference urls ... I haven't tested: https://groups.google.com/g/pandoc-discuss/c/bKhBB_uFW4o/m/uuLV7hMYCwAJ

ricopicone avatar Apr 26 '21 01:04 ricopicone

It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs

The HTML output is split into different files and crossreferences work.

I guess this tells me there's some way of doing this now ... any ideas how?

I'm pretty sure this isn't the best path, but epub files are made of multiple chunks of .xhtml, personally and for a while I've been doing this by generating .epub files with Pandoc and then using a task runner to automate the unzipping > extracting > parsing > processing > moving > fixing > renaming of the xhtml files as needed. That's an ugly hack I made a couple of years ago to solve this need and for a very specific case, maybe something like that could work for you meanwhile.

barriteau avatar Apr 26 '21 01:04 barriteau

Thanks @barriteau -- do you have your code for that? As may be the case for others, I'm making large html docs and having performance issues. There's only so much improvement I can get out of lazy loading images and the like ... mostly it's MathJax. But there's no significant reason for it to be one-file other than Pandoc. A stop-gap solution until this feature is implemented would be most welcome :)

ricopicone avatar Apr 26 '21 01:04 ricopicone

Yup, but I'm afraid that in its actual conditions is of no use for you, it's an old Grunt task with a lot of extra and specific routines for other different stuff. I'll take a look to it to find if it's worth to clean it for sharing and reuse, I'll let you know :)

barriteau avatar Apr 26 '21 01:04 barriteau

I've looked at how Bookdown does it before. Part of the reason it is so complicated is because it supports a fair number of Pandoc options, which changes the output that it then has to process. In fact, I use Bookdown currently. One of the things that makes me hopeful about Pandoc making this change is that it might fix a couple of problems I've got with Bookdown related to its splitting process.

jtbayly avatar Apr 26 '21 12:04 jtbayly

I still think my Feb. ~~20~~ 6 comment above gives a good route forward on this. Most of the technical issues have already been solved, since we already have to chunk things for EPUB. It would be good to have code that could simply be reused by the EPUB writer. I think the issue about "Next/Previous/Up" links could be solved simply by populating template variables; using a custom template, you could get whatever kind of navigation links you like.

So, rough plan would be

  • [ ] Implement the splitting function, something like splitIntoChunks :: FilePathTemplate -> Level -> Pandoc -> [(FilePath, Pandoc)]. Look at the EPUB writer's splitting code in implementing this.
  • [ ] Refactor the EPUB writer to use the general purpose splitting function.
  • [ ] Support output to a .zip container for any output format, as follows:
    • use splitIntoChunks to split the document into chunks,
    • then use the writer for the selected format to write each chunk (setting template variables for next/previous/up/top that could be used in a template)
    • then use zip-archive to create a zip that combines the chunks, as well as any media contained in the MediaBag. (EDIT: or just allow writing a directory directly; this may be more useful than creating a zip.)

jgm avatar Apr 27 '21 15:04 jgm

@jgm Still a somewhat vague idea of mine – do you think it’s possible to make your ideas more general? For example:

  • Input is [(FilePath, Chunk)].
    • A Chunk is either:
      • Pandoc
      • FileData. Not sure what exactly that type would look like. Sometimes data in RAM, sometimes a reference to a file on a hard drive?
  • Examples of input that immediately produces a tree = a sequence of Chunks:
    • a directory with HTML files
    • a directory with LaTeX files
    • a directory with Markdown files
  • Every transformation/compilation step is a function transformationFunction :: [(FilePath, Chunk)] -> [(FilePath, Chunk)]
  • Writers would also be such transformation functions.
  • Some chunks may remain files, others may change from FileData to Pandoc and back.
  • At the end, there would be a “persister” that writes a Chunk tree to the file system or to a ZIP file.

rauschma avatar Apr 27 '21 17:04 rauschma

Something else to consider:

In bookdown you can specify to split the HTML up by chapter, by section, or by file. I like that flexibility, fwiw, especially the split by file option. Split by chapter sometimes gives me way too long of webpages. Split by section sometimes leaves me with nothing but a chapter title on one webpage, and then you've got to go to the next webpage to get to the next section. Split by file lets me decide.

jtbayly avatar Apr 27 '21 17:04 jtbayly

@rauschma we already have a MediaBag to contain assets used by the document. These get passed through the plumbing in PandocMonad, so we shouldn't need to represent them explicitly. But I take the core of your idea to be that we might want to support "trees" (directories containing multiple documents) in both input and output (my proposal above is output only). This would require, at least, the change noted in https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ.

jgm avatar Apr 27 '21 17:04 jgm

@jtbayly - I don't know what "split by file" would really mean, when you're splitting up a Pandoc document. (It doesn't come chunked into files.)

jgm avatar Apr 27 '21 17:04 jgm

But it accepts multiple files as input, doesn't it?

jtbayly avatar Apr 27 '21 18:04 jtbayly

jgm, I think you are referring to your Feb 6 comment, not Feb 20. <rant>I detest github's "relative" dates. When I see "commented 22 days ago", I have no idea when that was without looking at a calendar. And "2 months ago" is meaningless.</rant>

In terms of planning, how would the TOC be done, and could that be templated as well? I'm thinking formats such as epub and htmlhelp need a TOC file in one form or another, and it would be nice if the output zip file (or directory) contained the TOC information in a form that could be turned into the required file. Even if you only intend to use the chunked html as a static web-site, you probably want to generate a TOC someplace in your site, perhaps a banner or column on every page. This file should respect the --toc-depth option.

Another question I have is how would I create an index. Here I am referring to an alphabetical index like you might see at the end of a book, not a TOC. Epub, HtmlHelp, and pdf all support such a concept. AFAIK Pandoc does not support an index natively. This may be a separate issue, and off-topic here, but I'd be interested in any thoughts you have about how to do this, even if it involves a filter and/or post-processing the output zip file/directory.

dm413 avatar Apr 27 '21 18:04 dm413

@jtbayly Yes, you can specify multiple files as input; however, everything is concatenated before parsing, and the parser doesn't even know which parts come from which files (this could be improved by https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ); moreover, the AST doesn't contain slots to represent source positions. A 'Pandoc' is an abstract representation of a document; you can get the same 'Pandoc' from multiple files or from one.

jgm avatar Apr 27 '21 19:04 jgm

@dm413 Yes, we need to figure out how to deal with the TOC. I think the simplest option is to generate a TOC for the whole document (tree) and put it in one of the generated files. But this may not be the best approach if you want the TOC in a side banner.

As for an index, that's a separate issue in a way, since you could want an index even with non-chunked output. Currently there's no built in way to construct one, but it's certainly possible to use a filter to define an indexing system. One difficulty with building in a general index system is that the requirements tend to be format-dependent. IF you want, you can create a separate issue for indexes on this tracker (if there isn't one already).

jgm avatar Apr 27 '21 19:04 jgm

I did a quick search, there is issue #6415 Built-in support for indices?

dm413 avatar Apr 27 '21 20:04 dm413

the parser doesn't even know which parts come from which files (this could be improved by https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ);

Interesting proposal.

I took a look at the bookdown code, since I wondered how they did it, given what you said about how Pandoc works. Apparently they add an HTML comment everywhere a split needs to happen before sending it to Pandoc, then they parse it afterwards using those comments to figure out where to split.

jtbayly avatar Apr 27 '21 20:04 jtbayly