mdBook icon indicating copy to clipboard operation
mdBook copied to clipboard

Support ebooks and pdf export

Open mdinger opened this issue 9 years ago • 69 comments

Gitbook supports export to ebooks and pdfs via calibre. This might be easy to hook into.

See also https://github.com/rust-lang/rust-by-example/issues/684 for problems this implementation creates for rustbyexample.

mdinger avatar Dec 30 '15 16:12 mdinger

I would like to support pdf and ebook format. I think this could already be developed out of tree if you use the Renderer trait from mdBook.

I am not sure I want to depend on a full blown Gui tool though. There must surely be a better alternative for that.

azerupi avatar Dec 30 '15 17:12 azerupi

Not familiar with many conversion tools like this. Pandoc also seems like a plausible option. Don't know of any others.

mdinger avatar Dec 30 '15 17:12 mdinger

Yeah pandoc seems a lot better!

azerupi avatar Dec 30 '15 18:12 azerupi

Did some exploration on this and seems doable. Here's the default epub version of the Rust book. Note the chapters out of order and links not working.

To get good output, I think we would need to:

  • parse the ToC to get the list of md files, in the right order
  • concat and transform the markdown files, replacing file links with internal links
  • match the themes with epub versions of the styles

I'm interested in working on this but will be a bit slow.

Useful info here: Pandoc commands and styling options

asolove avatar Jan 11 '16 14:01 asolove

  • parse the ToC to get the list of md files, in the right order
  • concat and transform the markdown files, replacing file links with internal links

@asolove, I have implemented this (among other transformations) in https://github.com/killercup/trpl-ebook, feel free to use my code.

killercup avatar Jan 11 '16 14:01 killercup

@killercup great, thanks!

asolove avatar Jan 11 '16 14:01 asolove

Great! Thanks for doing this :)

parse the ToC to get the list of md files, in the right order

This is already done in the Rust code, the MDBook struct can be iterated on. If you make a new Renderer you have access to that.

concat and transform the markdown files, replacing file links with internal links

Concatenating the markdown files is also not that hard, I do it for the print page.

Replacing the links could be a little trickier, what should internal links look like for pandoc? I know that pulldown-cmark gives you the ability to transform the parsed markdown events before rendering, but it's not well documented. Maybe link replacing is in it's capabilities.

Static files, like images, will probably also need some special treatment to be included correctly?


I'm interested in working on this but will be a bit slow.

That is absolutely no problem, there is no rush. ~~I will assign this issue to you so that others can see you are working on it.~~ (can't assign you). If you need any help, feel free to ask here :)

I am also planning on doing a big refactor (#90) to clean up and create a better API. For example, I am thinking about adding a way to poll the MDBook struct for specific chapters, etc. This would make it a lot more flexible for Renderers and if I end up doing something like #93. If you have suggestions or requests that might be relevant, post them in #90 so that I / we can brainstorm and come up with a good design :)

azerupi avatar Jan 11 '16 14:01 azerupi

Replacing the links could be a little trickier, what should internal links look like for pandoc?

FIY, I'm doing some regex work to transform links relative to the doc.rust-lang.org domain and make reference link names unique for the combined markdown file.

killercup avatar Jan 11 '16 15:01 killercup

FIY, I'm doing some regex work to transform links relative to the doc.rust-lang.org domain

let cross_section_link = Regex::new(r"]\((?P<file>[\w-_]+)\.html\)").unwrap();
output = cross_section_link.replace_all(&output, r"](#sec--$file)");

let cross_section_ref = Regex::new(r"(?m)^\[(?P<id>.+)\]:\s(?P<file>[^:^/]+)\.html$").unwrap();
output = cross_section_ref.replace_all(&output, r"[$id]: #sec--$file");

let cross_subsection_link = Regex::new(r"]\((?P<file>[\w-_]+)\.html#(?P<subsection>[\w-_]+)\)").unwrap();
output = cross_subsection_link.replace_all(&output, r"](#$subsection)");

let cross_subsection_ref = Regex::new(r"(?m)^\[(?P<id>.+)\]:\s(?P<file>[^:^/]+)\.html#(?P<subsection>[\w-_]+)$").unwrap();
output = cross_subsection_ref.replace_all(&output, r"[$id]: #$subsection");

Thanks! Does pandoc auto-generate the anchors from the markdown files in those formats? like #sec--$file? Or is that also handled by your code?

azerupi avatar Jan 11 '16 15:01 azerupi

@azerupi I'm pretty sure pandoc generates those. I've had problems before because pandoc generates slugs in a different way than rustdoc.

It should be possible to add a specific id to each header, though. The syntax is # Header Name {#header-name} IIRC.

You might also want to look at adjust_header_level.rs and adjust_reference_names.rs.

killercup avatar Jan 11 '16 15:01 killercup

Ok thanks for all the information, this will probably help @asolove a lot! :)

azerupi avatar Jan 11 '16 16:01 azerupi

Not sure if this will help you guys, but I've created a simple rust tool which will collate multiple markdown files into one, resolving internal links and turning them into anchor links

We can use this in a pipeline on the way to converting to PDF:

mdcollate book-example/src/SUMMARY.md | pulldown-cmark > test.html && wkhtmltopdf test.html test.pdf

Code can be found here: https://github.com/cetra3/mdcollate

Happy to accept any PRs

cetra3 avatar Jan 12 '16 06:01 cetra3

@cetra3 That is really cool! The plan is to make a "renderer" that does everything so that it can be used with the mdbook build command. So using a command line tool adds some complications. Have you thought about exposing the functionality as a crate?

I am not sure I would add a dependency just for that functionality, because there is always the possibility that it will not be maintained actively. But it could be considered if it offers enough useful methods that we wouldn't have to reinvent here.

azerupi avatar Jan 12 '16 11:01 azerupi

I'm also sceptical about Calibre. We use it in Russian translation of TRPL and we've come along several problems with EPUB (links are to descriptions in Russian, for reference):

mkpankov avatar Jan 12 '16 21:01 mkpankov

Thanks for sharing your experience :) We will see if pandoc has the same problems, but I think @killercup used it without too much / any problems?

I also vaguely remember we had to hack styles in order to get better PDF. Not sure if it's necessary or not with Pandoc

I am not sure how this is handled with Pandoc, but having a custom theme could be a good thing.

azerupi avatar Jan 12 '16 21:01 azerupi

It's probably possible to wrap up those command line tools into a combined tool or expose it as a rust library. The last component (html to pdf) would need to use FFI as wkhtmltopdf is written in C. Not sure whether this adds too much dependency on externalities though.

The complication arises in that markdown is a superset of HTML which means that you need something that can present HTML in a printable fashion. In my experience with this problem, Pandoc and Calibre will do a subset, but you won't get full parity.

cetra3 avatar Jan 13 '16 00:01 cetra3

There are a few things to be aware of, but in general pandoc is really amazing at converting Markdown to LaTeX. Which is what you want, I think—it has some very nice features that you currently can't get with HTML-to-PDF converters. For example, my PDF versions of the Rust Book include cross-references like "This is a mutable variable binding (section 5, page 163)".

If you're no LaTeX wizard (I'm not), you might want to look at this template I threw together.

If you have any issues with this, just mention me.

killercup avatar Jan 13 '16 08:01 killercup

Thanks for all your help Pascal! I will definitely look at what you have currently running and I am pretty sure we will end up stealing a lot of your code (if that is ok with you) :wink:

azerupi avatar Jan 13 '16 13:01 azerupi

+1 for the effort, I am looking forward to using mdbook to produce ebooks.

It seems to have stalled a bit, is anyone currently working on this?

gambhiro avatar Aug 08 '16 14:08 gambhiro

It seems to have stalled a bit, is anyone currently working on this?

Indeed, it has stalled a bit. In the last 6 months I have been overwhelmed with work at school :confused:

I am (very) slowly working on the refactoring / clean-up that I wanted to do. And that work is probably going to change the way this specific feature is going to be implemented. Hopefully I will have some time in September to make significant progress on the internal rewrite so that I can work on new features again.

azerupi avatar Aug 08 '16 15:08 azerupi

@azerupi How much space is there for discussing this feature? There are some specific things I would be looking for in a CLI ebook helper, but maybe you are already determined in which way to go.

Some time ago I wrote prophecy, a ruby gem to automate the tasks I needed when producing ebooks. This is and example of the output. It has been very useful for me, but I believe I am the only user :)

I have been wanting to rewrite it with some of the hindsight since its early days, but when I saw this I thought maybe mdbook would be able to produce the same results.

There is an asciinema recording to see to sort of things it does.

gambhiro avatar Aug 09 '16 11:08 gambhiro

I'm open to all ideas :)

azerupi avatar Aug 09 '16 12:08 azerupi

Thanks. I will gather my thoughts and post a longer comment on what ideas worked.

gambhiro avatar Aug 09 '16 19:08 gambhiro

When I was building prophecy, it was important to be able to:

  • A) quickly build a well-designed ebook from minimal input (i.e. just the manuscript in markdown and the TOC sequence)
  • B) but let the lib read settings in the book's folder to influence structural behaviour on a per-book basis,
  • C) have access to the attributes of the book and the chapters in the manuscript files (such as the ERB-style <%= chapter.title %> or inserting the ISBN number with <%= book.isbn_ebook %>)
  • D) add custom CSS with @font-face embedded fonts,
  • E) build the EPUB and MOBI with different settings,
  • F) generate valid ebooks, trigger no warnings in Sigil's validator
  • G) generate both the toc.ncx (for the TOC menu) and an HTML Contents page which the reader sees after the title page

I thought it was going to be an insane lot of config options, but after a while it started to be sufficient for any book, and this much was enough.

I should add that LaTeX is mentioned a lot in the sources, but I ended up avoiding to generate LaTeX content. Now this might be a different experience if you are not so picky about every paragraph.

I produce books for printing, and the LaTeX files have the most specific hair-trigger accurate tweaks and custom macros, so usually I produce the LaTeX sources first, and convert to markdown from there for the ebooks.

The book's config files were in .yml format, and their option keys would get overwritten from the general towards the specific.

  • book.yml - general info (title, author, ISBN)
  • epub_mobi.yml - shared data for EPUB and MOBI (chapter list for TOC, language attr, etc)
  • epub.yml - only for epub (such as excluding assets which only go to the mobi)
  • mobi.yml - only for mobi

For F), I wanted to be able to track content changes with a diff tool, but Sigil had a habit of renaming the folders. Check out the epub template that worked eventually. The lib also copies Fonts, Images and Styles from assets into the OEBPS folder. See the book Travessia for example.

(btw I lost the habit of committing the generated epub source files. It is useful to be able to inspect them when working on a book but committing it was overkill.)

Although the EPUB format is liberal about folder names for images, fonts, etc., I found that when other people made small corrections in the ebook with Sigil, it would rename the folders without asking. The above template avoids this.

For the design in A), three stylesheets were necessary, because you

  1. want to make it pretty in the EPUB for iPads and such, but
  2. for the Kindle screen (Paperwhite and newer) you need to tone it down or optimize for contrast and readability, and
  3. you want to support old Kindles which don't comprehend @font-face CSS.

You can see in page.xhtml.erb that it would select different stylesheets depending on whether it was building for EPUB or MOBI.

The new Kindles use the KF8 format, and you can support the legacy format with media="amzn-mobi" media query, while the new Kindles will take the media="amzn-kf8".

To generate MOBI, the best practice seems to be to build it as an EPUB, then run Amazon's Kindlegen CLI tool on it, to produce the MOBI.

Now Kindlegen has the strange logic of including the source EPUB in the resulting mobi file, and so your output will be double in size.

The response to this was the kindlestrip.py script by Paul Durrant, which strips this out from the mobi, see more on what it does in the comments in its header.

Back to prophecy, the gem had the stylesheets in the user's gemdir, so the behaviour was:

  • initialize a new book without the CSS
  • copy from the gemdir when building the ebook files
  • allow customizing with a command prophecy assets, which made a copy of the assets to the book's folder
  • if local assets were found, the lib would compile and use those instead of those in the gemdir.

Some of this wasn't so good. It turns out that almost all ebooks needed at least a little tweak in either the CSS, the typefaces, etc., and I got into the habit of always including the assets with the book.

This is probably enough for now. Phew, I hope all the links work! :)

gambhiro avatar Aug 10 '16 13:08 gambhiro

The details looks a bit scary, but much of the complexity is dealt with in the page templates.

Otherwise it is just this much:

  • read in settings
  • use local assets when provided
  • copy assets to the right folder
  • render the ebook data files
  • render chapters to HTML
  • zip to EPUB
  • plus kindlegen, kindlestrip when doing MOBI

gambhiro avatar Aug 10 '16 13:08 gambhiro

I would like to contribute code too, but I would need guidance. It has not been long since I read my first Rust tutorial :) I would much rather add to a well-designed outline where I can.

gambhiro avatar Aug 10 '16 13:08 gambhiro

@gambhiro Thanks for all info, I glanced over it and will certainly take a deeper look when the time comes to write the pdf / epub renderer!

In the mean time, I updated the issue for the refactoring I want to do. It outlines at a high level what I think should happen. This should unblock a lot of the most wanted features.

I put the points in the order I think they should be implemented, of course this order doesn't need to be followed strictly.

At this point I think it's important to discuss and iterate over some designs. So if you want to help out I suggest you take a look and start a discussion in the area that interests you :smiley:

There are a couple of issues that can be implemented directly too, like:

  • Switch to the log crate
  • Switch to Serde
  • Replace JS and CSS dependencies with their equivalent from npm, to allow more regular updates

I will happily mentor and answer any questions you have, don't be afraid to ask.

azerupi avatar Aug 12 '16 14:08 azerupi

Thanks. I'll level up :)

gambhiro avatar Aug 12 '16 16:08 gambhiro

Crowbook just released an update and I wasn't aware of it at all as a publisher to epub/pdf. Might be worth considering as a renderer for these formats.

mdinger avatar Jan 01 '17 04:01 mdinger

Yes, she is doing very diligent work on that project. I have been working on the ebooks for the past few days, and first thing I did was to read what she wrote in crowbook for this.

It parses the book's files to a data structure, selects the output format from defaults or what the user said on the cli, and gives the data to a function that will do whatever it needs to produce the output files.

Pretty much the obvious thing huh? Basically the task of rendering some markdown files to different formats is a simple and well-defined problem in its nature, so there won't be a lot of different solutions that make sense. It's a lot like the static-site generator idea as well.

One pleasantly surprising thing which @lise-henry does there though, is that she parses the documents into an "Abstract Syntax Language" tree (basically a tree of tokens down to paragraph, link, emphasis) and the output functions iterate over that data. This, instead of the output function reading in the markdown file for its own purposes and doing something with that.

That token tree allows the content to be processed in terms of writing prose or typographical customs, and so she wrote a module that lets crowbook have opinions about the grammar in the text or correct punctuation. She's a novel writer so it makes sense she would be interested in an automated sanity check which the machine can do.

I remember thinking, "hm, why don't I just use crowbook?" and the only convincing point I could come up with is that I like writing the code :) Which is pretty much why she started writing crowbook as well. I think we like to be at that place, writing code to solve a problem we understand.

There is this TED talk by Uber's founder Travis Kalanick, where he is asked why is he doing this. He says, "... the way I like to describe it is it's kind of like a math professor. You know? If a math professor doesn't have hard problems to solve, that's a really sad math professor."

Have you tried to write a basic but efficient primality test? I think you would find it satisfying!

Best wishes for the new year!

gambhiro avatar Jan 01 '17 08:01 gambhiro