mupdf-rs icon indicating copy to clipboard operation
mupdf-rs copied to clipboard

How to include image in `Page`'s `to_html` or `to_xhtml` method?

Open LazyGeniusMan opened this issue 2 years ago • 1 comments
trafficstars

When I try coverting a page that have image to html or xhtml, the image is not included. With this code:

fn main() {
    use mupdf::{Document, Page};
    use std::fs;

    let doc: Document = Document::open("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\test.epub").unwrap();
    let page: Page = doc.load_page(341).unwrap();
    let html: String = page.to_html().unwrap();

    fs::write("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\rs-test.html", html);
}

I got this result: image

there should be an image above Figure 10.3 text.

I tried to do the same thing in PyMuPDF with this code:

import fitz

doc = fitz.Document('C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\test.epub')
page = doc[331] # the page index is somehow different for the same page I want
html = page.get_text("html")

with open("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\py-test.html", "w") as file:
    file.write(html)

I got this result: image

the image is included in base64 format.

I also tried doing the same thing via mutool convert cli, and can get the same result but there's an option that need to be enabled, I dont find anyway to set this thing in to_html method of this crate. The option in mutool look like this:

Text output options:
        inhibit-spaces: don't add spaces between gaps in the text
        preserve-images: keep images in output
        preserve-ligatures: do not expand ligatures into constituent characters
        preserve-whitespace: do not convert all whitespace into space characters
        preserve-spans: do not merge spans on the same line
        dehyphenate: attempt to join up hyphenated words
        mediabox-clip=no: include characters outside mediabox

LazyGeniusMan avatar May 19 '23 01:05 LazyGeniusMan

Sorry, this project is not actively maintained at the moment, but I'm happy to accept pull requests to fix this if anyone is up for it.

messense avatar May 19 '23 02:05 messense