paperless-ng
paperless-ng copied to clipboard
Mail consumer: Consume emails without any attachments
I was trying out the mail consumer, sent myself an email with a pdf attachment to a dedicated email address that I setup up just for Paperless-ng. Worked great btw. Then I went to a home improvement store that I knew gave the option of emailing receipts. Made a purchase, got the email but the receipt was in the form of an HTML email, with no attachments, so nothing was consumed.
With maybe an additional library could something like "HTML to Plain Text" and/or "HTML to PDF" be done with options in the mail rules?
Example: If no attachment then convert email to Plain Text file or PDF and with the use of filters this only applies to certain emails.
As it seems, Apache Tika (if enabled in paperless-ng) should handle this format. (https://tika.apache.org/1.3/formats.html).
Apache Tika is only configured to handle Office documents right now.
I have the same wish with a similar scenario: An email often contains the additional information (e.g. tracking number) only in the body of the e-mail, but not in a PDF file. If I search for the tracking number, I can't find anything. Also, if I know the invoice number, I don't know which tracking number it belongs to. So I always have to keep the emails.
It would be great if the text from the email along with the PDF file were also archived and indexed so that the search would work in both directions.Otherwise I have to convert the emails separately to PDF and archive them in another way.
I've got the following idea for that, not sure yet if that will work.
See #274 for custom metadata. Especially my comment over here: https://github.com/jonaswinkler/paperless-ng/issues/274#issuecomment-766062277. My idea would be that you could use that to create a custom "mail content" field, and then tell paperless to store the mail content in that field when it adds documents from mails.
The custom metadata feature will ensure that this field is searchable and the document detail editor will show this field.
What do you think about that?
I see two different tasks here: presenting the email and providing additional metadata. I think the application has great potential - it can also be used for email archiving. (There is also a high demand here). Therefore I thank that emails should be displayed and archived separately. The basis is already there - Gotenberg can also convert HTML (EML)
My thoughts on saving the metadata are:
- Generating the metadata and
- Saving the data in xmp tags in PDF (e.g. as a comment field or as user-defined fields).
- Then transfer from PDF to the database. The metadata from the email header (e.g. the email ID) can be retrieved and used. As a result, all documents from this email are referenced and linked. These are therefore found as a bundle via the index.
Why not just archiving an email as HTML as it is?
If i want to archive all the PayPal receipts i am getting per mail. We could just store the html file (mime text/html) and display it.
Via custom metadata one could save things like the payed sum.
I am as well searching for a way to store the e-mail body. Gui of paperless-ng is way better then do_c_spell imho But do_c_spell has this nice email archive feature.
I'm also looking for something like this and stumbled across a service that does a decent job of converting emails to PDF, you just forward the email to [email protected] and it sends it back as a PDF. They've open-sourced it here: https://github.com/thunderkeys/pdfconvertme-public, hasn't been updated in a while though. It would be great if something like this could be integrated with paperless, for my use case it would only need a flag "Convert emails to PDF where there's no attachment".
I made this docker image to achieve what I think you wish to do (or at least a work around). Please try it out and let me know if I can improve it in any way. https://github.com/rob-luke/emails-html-to-pdf
I have an imap folder called Paperless
that I drop emails I wish moved to paperless-ng. Paperless sorts and deletes the emails that contain pdfs. My script grabs unread emails without attachments, converts them to pdf, then emails them to the address you wish. A good trick is to use the + notation with your email account. So I sent them to [email protected]
and that automatically puts the converted pdf in my paperless imap folder, then paperless grabs it correctly and processes it as usual. Feedback is welcome
Thanks for this, it works fairly well - I've left an issue in your repo :)
I think even if there are attachments, the mail body often contains important information. So it would be nice to always (maybe with a rule/setting) extract also the body.
This would be a huge boost to the project. Right now, I have to print important Emails to PDF, and save them to my consume directory. Moving them to my Paperless folder would save a ridiculous amount of time and effort.