email2pdf
email2pdf copied to clipboard
email2epub or email2mobi
I do not know if it is not beyond the scope of the project. I would appreciate a tool to convert several emails to a single epub. There exist a library (https://github.com/setanta/ebookmaker) for conversion of html-files to epub.
Verlonis, it's an interesting idea. I would consider it beyond the scope, yes - although I've no doubt that email2epub or email2mobi utilities could be interesting. I don't have the time to work on those personally, so I'm afraid I'm going to close this bug. Would be happy to link to/publicise/support them if you wanted to write one, though :)
Actually, having looked at the library you linked to, I'll re-open this. Maybe it wouldn't be such a tricky extension after all. The main challenge is that most emails aren't that well structured (certainly not in the way the README for ebookmaker describes. But this could be a fun weekend project sometime! I'll leave the issue open, I might try it sometime. But no promises :)
Do you have any sample emails of the type you'd want converted? Might be helpful to attach them to this issue as examples...
I do not have any particular emails in mind. Most emails I get are simple textfiles, sometimes with attached pictures. The idea behind this would be to get the e-mails on the e-book readers and read them e.g. in the bus. It would be nice if the conversion could be managed on the command line. epub2mobi conversion is not a problem since amazon provides a commandline tool for this issue (however a closed source one). Pdf files are not convenient for e-book-readers to use. When describing the issue I hoped the most of your code can be reused. I think the very, very faithful conversion is not really essential. If some elements (like bold-italic text) are not converted exactly it is still useful. Some other information about epub-generation http://stackoverflow.com/questions/3454894/how-to-programmatically-convert-html-to-epub
OK, thanks for the thoughts. If I get some time I'll look at this - but no promises, sorry!
Looking a little further at this, looks like the book metadata is described using a .json
. Would assume in our case we want a very simple version of this, that points at a .html
which is just the same content as normally gets extracted from an email.
Yes, I see this this way too. However my idea was to point to several .html files at once and make one e-book of all e-mails. Certainly the best would be if you were able to configure this freely (one e-book made on the base of all e-mails according to criteria you choose e.g. one day, recipient, one hour, sender etc.) or each e-mail is one book and the title is the subject. I think both ideas are in some context valuable. I think we need a script to clean-up the original html files (simply discard the unknown/unnecessary mark-ups).
And we need a very simple conversion tool plain text ---> html .
What to do with attached pdf-s? Shall we convert them or discard? (https://github.com/iainb/pdf2epub or ghostscript $gs -dNOPAUSE -sDEVICE=jpeg -r144 -sOutputFile=p%03d.jpg file.pdf). The idea to convert them to an image seems simpler and more universal however in many cases the generated files would be less readable.
Verlonis, OK but at the moment email2pdf mostly acts as a MDA - in other words, it acts on one email at a time. The multiple .html
outputs would come from multiple emails, potentially.
Let me propose an alternative - how about:
- Extend email2pdf to produce an email2html command - either a brand new command or via command-line options. This could be useful standalone. Note that email2pdf already contains code to convert plain text emails to html anyway, since that's required by wkhtmltopdf, so that shouldn't be tricky. Presumably attachments (including images) would be detached and saved by this too.
- Provide a 'clean-up' script that further refines those
.html
files to remove surplus markup and create the.json
file that ebookmarker needs.
(1) is definitely much easier than (2) to implement.
Yes, exactly. The best practice in unix are the small cli-programs that do their work in a restricted range but do it well and then send the result to the next specialised tool.
In your chain you have already getmail.
getmail --> bunch of e-mails .eml bunch of e-mails in a directory ---> bunch of .html -files
at this moment the workflow forks clean-up of the .html-files create a .json file (for each file one or for several files at once) create an epub or epubs OR create pdf file/files (? - consider enhancement, but now not really needed since you can at this point you can still merge your pdf-files with gs or so, only for consistency)
Regarding clean-up script maybe you would like to have a look on
http://www.html-tidy.org/ or http://home.ccil.org/~cowan/XML/tagsoup/
and http://www.jedisaber.com/eBooks/formatsource.shtml (more general)
Hi Andrew, proposed workflow:
¦
¦ ----- -----
¦ ¦ ¦
¦ ¦ ¦
¦ getmail \ -------------- email2pdf -p \ -------------- ¦ ghostscript \ ------------- ¦
¦ \ ¦ .eml .eml ¦ \ ¦ .pdf .pdf ¦ ¦ \ ¦ .pdf ¦ ¦
¦-------------> ¦ .eml .eml ¦ ----------------> ¦ .pdf .pdf ¦ ¦ ------------------> ¦ (merged) ¦ ¦
/ ¦ .eml .eml ¦ / ¦ .pdf .pdf ¦ ¦ / ¦ ¦ ¦
/ ¦------------¦ / ¦------------¦ ¦ / ¦-----------¦ ¦
¦ ¦ ¦
¦ ¦_____ ____¦
¦ email2pdf -h
¦ + ghostscript -sDEVICE=jpeg
¦ + maybe html-tidy or similar
\ ¦ /
\ ¦ /
\¦/ ---------------------
ˇ Calibre \ ¦ .mobi .mobi .mobi ¦
--------------------------------- or KindleGen \ ¦ .mobi .mobi .mobi ¦
¦ ¦ -----------------------> ¦-------------------¦
¦ .html .jpg (attachment) ¦ or ..... /
¦ .html ¦ /
¦ .html .jpg (attachment) ¦
¦ .html ¦ ebookmaker \
¦ .html ¦ or Calibre \ ----------------------
¦ .html .jpg (attachment) ¦ -----------------------> ¦ .epub .epub .epub ¦
¦-------------------------------¦ or ..... / ¦ .epub .epub .epub ¦
/ ¦--------------------¦
¦
¦
¦script or
¦simple program
¦
¦
\ ¦ /
\ ¦ /
\¦/ ---------------------
ˇ Calibre \ ¦ .mobi ¦
--------------------------------- or KindleGen \ ¦ (merged) ¦
¦ .html (central file) ¦ -----------------------> ¦-------------------¦
¦ .html .jpg (attachment) ¦ or ..... /
¦ .html ¦ /
¦ .html .jpg (attachment) ¦
¦ .html ¦ ebookmaker \
¦ .html ¦ or Calibre \ ----------------------
¦ .html .jpg (attachment) ¦ -----------------------> ¦ .epub ¦
¦-------------------------------¦ or ..... / ¦ (merged) ¦
/ ¦--------------------¦
```
`
Thanks for suggesting this and for taking the time to diagram. I like this. It also means only minimal changes to email2pdf to also support .html + attachment support, which I like. I think I would consider any merging/converting script to be separate. If I get time, I'll look at adding the first item. I may not get to this for some time, though!