paperless Support for Office-Formats with Apache-Tika

Motivation and Description

This pull request adds support for office formats (such as odt, ods, docx, etc.) in paperless. In order to process these files and extract their content, Apache-Tika is applied:

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

I started implementing this feature because I have office files stored on NFS not being part of my paperless instance. My goal is to have my personal documents in one place accessible via full-text search. This place should be paperless!

This is a work in progress PR as I would like to receive initial feedback from the community before I start polishing it towards merging. Especially, I would like to know:

What do you think about the current way of including Apache-Tika into paperless?
Do you have any high-level remarks regarding this first working prototype?
Is this feature something people might find useful?
Is there a chance of getting this feature merged?
Any other comments?

This is a first working prototype - the feature is not complete and still under development (see open tasks).

In the short term, I would like to handle ms-office and open-office formats with this parser. In the long term, however, Apache-Tika might replace even the PDF/OCR parser since Tika also comes with support for tesseract. I haven't decided on the latter yet and the change will definitely not be part of this PR.

Open Questions

I had to make two changes in order to get the original master working. I wonder, why nobody else has problems with the docker setup of the current master.

1. `CORS_ORIGIN_WHITELIST` issue

I had to add the protocol to the value in CORS_ORIGIN_WHITELIST:

CORS_ORIGIN_WHITELIST = tuple(os.getenv("PAPERLESS_CORS_ALLOWED_HOSTS", "http://localhost:8080").split(","))

Otherwise, I get the following error:

SystemCheckError: System check identified some issues:
consumer_1     |
consumer_1     | ERRORS:
consumer_1     | ?: (corsheaders.E013) Origin 'localhost:8080' in CORS_ORIGIN_WHITELIST is missing  scheme or netloc
consumer_1     | 	HINT: Add a scheme (e.g. https://) or netloc (e.g. example.com).

Is this an issue for someone else too?

2. Missing Fonts Issue

For some reason, I had to add msttcorefonts-installer and fontconfig to the Dockerfile. Otherwise, I get the following error:

Parsers available: TikaDocumentParser
consumer_1     | Consuming /consume/test.odt
consumer_1     | convert: unable to read font `helvetica' @ error/annotate.c/RenderFreetype/1384.
consumer_1     | convert: no images defined `/tmp/paperless/paperless-5dfrf94n/tx.png' @ error/convert.c/ConvertImageCommand/3273.
consumer_1     | PARSE FAILURE for /consume/test.odt: Convert failed at ('convert', '-background none', '-fill', 'black', '-pointsize', '12', '-border 4 -bordercolor none', '-size ', '492x639', ' caption:"', '\n\n\n\n\n\n\n\nThis is a document\n', '" ', '/tmp/paperless/paperless-5dfrf94n/tx.png')
consumer_1     | Parsers available: TikaDocumentParser
consumer_1     | Consuming /consume/test.odt
consumer_1     | convert: unable to read font `helvetica' @ error/annotate.c/RenderFreetype/1384.
consumer_1     | convert: no images defined `/tmp/paperless/paperless-ys5cdlyn/tx.png' @ error/convert.c/ConvertImageCommand/3273.
consumer_1     | PARSE FAILURE for /consume/test.odt: Convert failed at ('convert', '-background none', '-fill', 'black', '-pointsize', '12', '-border 4 -bordercolor none', '-size ', '492x639', ' caption:"', '\n\n\n\n\n\n\n\nThis is a document\n', '" ', '/tmp/paperless/paperless-ys5cdlyn/tx.png')

Does anybody else have such an error on master? If not, I'm fine to drop the change as it seems that it is a particular problem with my local machine setup.
Otherwise, I can separate the change in its own commit and combine it with the other RUN command.

Open Tasks

[x] Add ms-office file formats to TikaDocumentParser
[ ] Add tika to non docker installation (see requirements and setup)
- I probably have to make the tika URL configurable
[x] Add Unit-Tests
[x] Add Exception Handling
[ ] Issue UI text field contains a lot of newlines after odt import
[x] Add documentation
...
[x] Comply to Pep8 and additional style guides (See guidlines)
[ ] Rebase and Squash

Jan 09 '20 21:01 Tooa

Hey,

I wonder if it would make sense to convert those files to PDF in the consumer.

Having all documents stored as a PDF file has some advantages IMO:

Documents are more portable, more systems can display a PDF document than a Word document
All fonts etc. are embedded, so your document looks the same even if you use a different system to view it
Makes it easier to switch to a different system in X years

What do you think?

Jan 14 '20 14:01 bauerj

I wonder if it would make sense to convert those files to PDF in the consumer.

I see your point. Let me describe my use-case in more detail:

I have an office document for let's say to cancel the insurance. I convert it to PDF, archive it in paperles and send the PDF via e-mail to the insurance. The next time I have a similar affair, I want to grab the original office document from paperless, change the details and send it as PDF to the next insurance. So the office documents function as templates somehow. At the moment, I store these templates on a separate NFS share.

Probably the use-case is really specific to me and the NFS share solution might be sufficient. However, I don't like different ways of accessing my documents. Maybe it's not worse the effort. Should have asked beforehand.

All fonts etc. are embedded, so your document looks the same even if you use a different system to view it

Ah! That explains the problems with office documents and the font in the container. It was not necessary to provide them until now.

Jan 15 '20 08:01 Tooa

Just commenting on "1. CORS_ORIGIN_WHITELIST issue" I set PAPERLESS_CORS_ALLOWED_HOSTS="http://localhost:8000" directly in paperless.conf and it works.

Jan 22 '20 18:01 Whisprin

While I see your use-case, I don't see tika being a part of paperless. It's one thing to run a couple of binaries to parse a file, but it's imho another one to run a java based webserver next to a small app like paperless for this purpose. I'd rather see this implemented in another project, as a django app people can include, if they want to - maybe even under the hood of https://github.com/the-paperless-project.

That being said, some of your already proposed changes - like supporting more document types in the model - seem to be neccessary. Others, like the tika dependency and the additional django module for paperless_tika, should imho be moved. I'd be happy to have that project referenced in the README and docs.

That's my 2 cents.

May 14 '20 20:05 MasterofJOKers

@Tooa Hi there. I've been working on a fork of paperless for a while now. If you're still around and want to make this happen, maybe we can work something out. Contact me if you're interested in contributing and we can discuss the details.

Dec 10 '20 14:12 jonaswinkler

paperless paperless copied to clipboard

Support for Office-Formats with Apache-Tika

Motivation and Description

Open Questions

1. CORS_ORIGIN_WHITELIST issue

2. Missing Fonts Issue

Open Tasks

paperless
paperless copied to clipboard

1. `CORS_ORIGIN_WHITELIST` issue