[Feature Request]: EPUB to PDF conversion needed
Feature Description
:-)
Thank you
Why is this feature valuable?
No response
Suggested Implementation
No response
Additional Information
No response
No Duplicate of the Feature
- [x] I have verified that there are no existing features requests similar to my request.
Can I work on this?
Absolutely!
Hi @0x-2FA,
I wanted to check in and see if you’ve made any progress on this. This feature has recently become personally relevant to me, and I’m also willing to work on it. Of course, I don’t want to step on your toes if you’re already making good progress or are close to opening a PR.
Let me know where things stand :)
Thanks!
@Balazs-Szucs Yes of course. I want that feature myself thats why I wanted to contribute. Unfortunately I didn't find the time in order to dig in. In fact I cloned the repo quite recently. You are more than welcome to take the task.
Thanks for response @0x-2FA, no worries. Not urgent on my side. Thanks for your work :)
Will look into it this week I finally have some free time
I think I have it ready!
@Frooodle But before I do the finishing touches I need your help.
Do you think it is better to have a separate tool in the Convert to PDF section or should I simply add it to the existing Convert file to PDF?
I think adding it to Convert file to PDF and updating the Supported Files is the best option imho.
I was thinking separate 😂
In V2 coming out in few months we are combining everything But for now I'd want to have to separate to call it out as a new feature etc
Sure, no problem at all. I’ll also add a link for it. Do you have any suggestions for the text label? I can’t think of a good name. I would assume Epub to PDF, but I’d like to hear your opinion too.
Some people prefer book to pdf But I think epub is more explanatory and good for this stage
I'd have name Epub to pdf
Description Convert the book format epub to PDF
Great on it!
Also for reference, I've tested all the epubs from here https://epubtest.org/test-books and all of them converted successfully to pdf. If you have any specific epub for testing, feel free to tell me.
I'll do some testing once PR is up but otherwise I trust your testing!
@0x-2FA Hows it going?
@Frooodle I mostly finished it back then, but I noticed that there is an issue with calibre on Alpine.
The package doesn’t work on alpine:3.22.1. I can only get it to work on alpine:edge. It probably depends on some packages that 3.22.1 doesn’t have, which enable calibre to run.
One option might be adding glibc into Alpine so we can use the prebuilt binary, but I’m not a fan of that approach.
I’m also not keen on using alpine:edge, though I noticed we add the edge repos anyway. If I understand correctly, whatever package we pull comes from edge, which explains the inconsistency and why calibre can’t run.
We might need to wait for them to include it in an official Alpine release instead of just the edge branch.
What do you think?
ahh i dont want to use calibre, we previously used it and it had so many issues I migrated away, and unsupported it
I recommend just building the EPUB out directly instead of 3rd party. Another great OSS PDF tool https://github.com/iib0011/omni-tools (Although their PDF editor is not OSS and 3rd party closed) is a good example of this for PDF to EPUB and should be possible to go backwards as well due to how the format works (or at least EPUB to HTML then re-use HTML to PDF
Oh ok ok. Will look into it then, we can go no 3rd party at all or use something like this https://github.com/psiegman/epublib.
Yeah that looks good too!
Hi,
sorry butt in to the conversation but epublib is unmaintained and has severe limitations with more modern EPUB specs.
However, a slightly better maintained fork exist albeit not very popular: https://github.com/documentnode/epub4j
Can you @0x-2FA please use that instead for now?
Honestly I doubt we need even that lib (But lets see)
Thanks for the call out!
Will look into it this week. Since we already have the html to pdf I think we might not need the lib. But I think that it might help with Pdf to Epub conversion (like the creation of the epub after reading the pdf content).
Update on the issue. I feel that epub4j is too good to skip for both Epub to Pdf and Pdf to Epub.
Some things I noticed while working with Epubs this weekend (many of which I did not know before):
- Every epub should contain a
.opffile (Open Package Format). This is just anxmlfile that shows the order to display the files in. - Some epubs have
.htmlfiles and some others have.xhtmlfiles.
The library is very helpful for reading the .epub file and its resources (HTML, CSS, images, fonts, and so on). It is especially useful for parsing the .opf file and constructing the "spine" of the Epub which tells us the correct order of the content files.
I think we have 2 options:
- We can do is a pass with
HtmlToPdffor each.htmlfile and thenMergethem in the order that the.opffile provides. - Create a big
.htmlfile (aka merge all the html files first) and then doHtmlToPdf.
I prefer the first option. It seems safer because we do not need to modify any of the original HTML content.
I prefer the first option. It seems safer because we do not need to modify any of the original HTML content.
I may be stand corrected here but hard disagree.
As for trade off here:
The PDF tool doesn't see the whole book at once, so you miss out on big-picture stuff like:
- Spot-on page numbers for a table of contents or index
- Headers and footers that change based on the full layout
- Links that jump between chapters (like from page 5 to page 200)
- Fancy CSS tricks for pages, like forcing breaks, that need the entire flow to make sense
- The tool gets the full picture, so things like page numbers, headers, footers, jump links, and overall styling rules come out good probably.
- CSS relative links (like "grab the image from ../images/cover.jpg") would break when you combine everything.
WeasyPrint doesn't always handle relative CSS stuff perfectly anyways (it can be finicky and require extra setup like base_url flags), so that might get "ruined" either way. However, if I'm right (and I could be wrong on this POC would good here), then we could at least preserve the features I listed in section 1, in my opinion, are more valuable overall anyways. Ebook tend have not that many images (I think) but they always have table of contents for example.
I did the EML-to-PDF conversion, and I can tell you for 100% that WeasyPrint does not play well with images, no matter what magic you try. However, indexes, links, headers, footers, and chapter jumps should work well if you go with the second route. (but it might need some love on your JAVA code e.g., some fixing here and there)
Obviously I am 3rd party here, I am assuming you/Frooodle will have final say, but if it were up to me I would go 2. option 100%.
I think kind of quick/messy Proof-of-concept may not be that hard so that also there to settle the "argument".
(sorry to dump this work on you like this, it does feel bit dirty, since 2. probably harder but it would yield much better results, I think atleast.)
sorry about grammar should be fixed now
Here is quick example;
- HTML
- the PDF Weasyprint manages to create
I can tell you, I tried everything to fix this, I don't think this is possible to make it work. I am assuming you'll get most likely similar result.
This is BTW did some adjustments to the CSS, if I wouldn't have touched it is even worse where half of the picture is out of the "frame"
But links btw work perfectly in the footer 😄 so atleast that works reliably.
For html to PDF we really need a web based renderer solution really.. Wkhtmltopdf used to be good but it's unsupported and to many security issues
Chromium or similar could be interesting 🤔
Hey, thanks for the feedback. That’s the reason I shared an update in the first place, and you’re more than welcome to add suggestions or ideas anytime. I will try both options and provide another update. The second option might also be more efficient, I was just thinking it might not be worth it because we would need to dig in and do some manual work to merge the various HTML files.
I’ll look into it and update you again. Since I haven’t yet tested the actual conversion with WeasyPrint, I will also check whether any issues with images or CSS arise. Tbh Im not really concerned about the CSS part since (in most cases) it is just a single file where they set the font family or text weight etc, nothing too advanced. The images part will be the most interesting, especially after reading your comment 😆
Also regarding Froodle’s comment, that’s the reason I initially went with Calibre. So we wouldn’t have to handle all of this ourselves. But, as you already know it had some issues with Alpine😅
Edit: Added info on the css/images part.
Hi,
I was researching for other unrelated when stumbled across this: https://flothesof.github.io/pdf-conversion-kindle3.html
Ghostscript can output PDFs that are optimized for Kindle and other book readers. I'd love this as an optimization in the conversion if possible. I haven't done very that much research into this but I think this is very much worthwhile option for our book/comic format. I plan to adjust CBZ/CBR converter also to use this (in the future) because from what I can see this is very good
It does have a lot of parameters which makes it bit hard to use imho, so don't feel like you're forced to use it if you get initially "bad" results. (it can also be thread safe since it can process each "page" separately so people don't complain about performance 😆 )
We might be able to use this also as pipeline or something e.g., "Optimize PDF for ebook reading" or something (but this is very much up for discussion I don't want to force anything here). Since I do personally read lot of online stuff I'll be also testing privately, but I have high hopes. :)
Anyways giving it a bit more thought this might better long-term enchantment but still keep in mind
@0x-2FA any news?
Hey, sorry I haven’t looked at this in a while but I'll review it again this week. I remember I was really close.
Hi,
No worries, I don't this is urgent by any means. I just have library I want to migrate/archive 😄