paperless-ng
paperless-ng copied to clipboard
[BUG] Installed from script and Gotenburg and Tika not working?
Hello, thanks for this great work!
I am new to paperless-ng do not normally use docker, so I may be doing something wrong.
My paperless works well, but when I try to import a .docx file for example, it fails with, Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
I installed using the script, and specified to enable Tika.
Gothenburg and Tika are running according to docker ps
paperless@docker ~/paperless-ng$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8a20a33aefa6 jonaswinkler/paperless-ng:latest "/sbin/docker-entryp_" 2 minutes ago Up 2 minutes (healthy) 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp paperless-webserver-1
b4d6babc41a2 postgres:13 "docker-entrypoint.s_" 24 minutes ago Up 23 minutes 5432/tcp paperless-db-1
ed4b52bfb5a4 redis:6.0 "docker-entrypoint.s_" 24 minutes ago Up 23 minutes 6379/tcp paperless-broker-1
d8bf67ec76c5 thecodingmachine/gotenberg "/usr/bin/tini -- go_" 24 minutes ago Up 23 minutes 3000/tcp paperless-gotenberg-1
85843f762418 apache/tika "/bin/sh -c 'exec ja_" 24 minutes ago Up 23 minutes 9998/tcp paperless-tika-1
paperless@docker ~/paperless-ng$ docker-compose up
[+] Running 5/5
_ Container paperless-tika-1 Running 0.0s
_ Container paperless-gotenberg-1 Running 0.0s
_ Container paperless-db-1 Running 0.0s
_ Container paperless-broker-1 Running 0.0s
_ Container paperless-webserver-1 Created 9.2s
Attaching to paperless-broker-1, paperless-db-1, paperless-gotenberg-1, paperless-tika-1, paperless-webserver-1
paperless-webserver-1 | Paperless-ng docker container starting...
paperless-webserver-1 | Creating directory /tmp/paperless
paperless-webserver-1 | Adjusting permissions of paperless files. This may take a while.
paperless-webserver-1 | Waiting for PostgreSQL to start...
paperless-webserver-1 | Apply database migrations...
paperless-webserver-1 | Operations to perform:
paperless-webserver-1 | Apply all migrations: admin, auth, authtoken, contenttypes, django_q, documents, paperless_mail, sessions
paperless-webserver-1 | Running migrations:
paperless-webserver-1 | No migrations to apply.
paperless-webserver-1 | Executing /usr/local/bin/supervisord -c /etc/supervisord.conf
paperless-webserver-1 | 2022-02-01 11:22:15,874 INFO Set uid to user 0 succeeded
paperless-webserver-1 | 2022-02-01 11:22:15,875 INFO supervisord started with pid 1
paperless-webserver-1 | 2022-02-01 11:22:16,877 INFO spawned: 'consumer' with pid 36
paperless-webserver-1 | 2022-02-01 11:22:16,879 INFO spawned: 'gunicorn' with pid 37
paperless-webserver-1 | 2022-02-01 11:22:16,881 INFO spawned: 'scheduler' with pid 38
paperless-webserver-1 | [2022-02-01 12:22:17 +0100] [37] [INFO] Starting gunicorn 20.1.0
paperless-webserver-1 | [2022-02-01 12:22:17 +0100] [37] [INFO] Listening at: http://0.0.0.0:8000 (37)
paperless-webserver-1 | [2022-02-01 12:22:17 +0100] [37] [INFO] Using worker: paperless.workers.ConfigurableWorker
paperless-webserver-1 | [2022-02-01 12:22:17 +0100] [37] [INFO] Server is ready. Spawning workers
paperless-webserver-1 | 12:22:17 [Q] INFO Q Cluster romeo-idaho-nine-diet starting.
paperless-webserver-1 | [2022-02-01 12:22:17,742] [INFO] [paperless.management.consumer] Using inotify to watch directory for changes: /usr/src/paperless/src/../consume
paperless-webserver-1 | 12:22:17 [Q] INFO Process-1:1 ready for work at 61
paperless-webserver-1 | 12:22:17 [Q] INFO Process-1:2 ready for work at 62
paperless-webserver-1 | 12:22:17 [Q] INFO Process-1:3 monitoring at 63
paperless-webserver-1 | 12:22:17 [Q] INFO Process-1 guarding cluster romeo-idaho-nine-diet
paperless-webserver-1 | 12:22:17 [Q] INFO Process-1:4 pushing tasks at 64
paperless-webserver-1 | 12:22:17 [Q] INFO Q Cluster romeo-idaho-nine-diet running.
paperless-webserver-1 | 2022-02-01 11:22:18,836 INFO success: consumer entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1 | 2022-02-01 11:22:18,836 INFO success: gunicorn entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1 | 2022-02-01 11:22:18,836 INFO success: scheduler entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
paperless-webserver-1 | 12:22:47 [Q] INFO Enqueued 1
paperless-webserver-1 | 12:22:47 [Q] INFO Process-1 created a task from schedule [Check all e-mail accounts]
paperless-webserver-1 | 12:22:47 [Q] INFO Process-1:1 processing [lithium-edward-diet-utah]
paperless-webserver-1 | /usr/local/lib/python3.9/site-packages/imap_tools/mailbox.py:214: UserWarning: seen method are deprecated and will be removed soon, use flag method instead
paperless-webserver-1 | warnings.warn('seen method are deprecated and will be removed soon, use flag method instead')
paperless-webserver-1 | 12:22:50 [Q] INFO Process-1:1 stopped doing work
paperless-webserver-1 | 12:22:50 [Q] INFO Processed [lithium-edward-diet-utah]
paperless-webserver-1 | 12:22:50 [Q] INFO recycled worker Process-1:1
paperless-webserver-1 | 12:22:50 [Q] INFO Process-1:5 ready for work at 77
paperless-broker-1 | 1:M 01 Feb 2022 11:23:06.030 * 100 changes in 300 seconds. Saving...
paperless-broker-1 | 1:M 01 Feb 2022 11:23:06.031 * Background saving started by pid 20
paperless-broker-1 | 20:C 01 Feb 2022 11:23:06.044 * DB saved on disk
paperless-broker-1 | 20:C 01 Feb 2022 11:23:06.044 * RDB: 0 MB of memory used by copy-on-write
paperless-broker-1 | 1:M 01 Feb 2022 11:23:06.132 * Background saving terminated with success
paperless-webserver-1 | [2022-02-01 12:24:01,094] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1 | [2022-02-01 12:24:01,184] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1 | [2022-02-01 12:24:04,271] [WARNING] [django.security.SuspiciousSession] Session data corrupted
paperless-webserver-1 | 12:24:14 [Q] INFO Enqueued 1
paperless-webserver-1 | 12:24:14 [Q] INFO Process-1:2 processing [Dear Facilitators.docx]
paperless-webserver-1 | [2022-02-01 12:24:15,000] [INFO] [paperless.consumer] Consuming Dear Facilitators.docx
paperless-webserver-1 | [2022-02-01 12:24:15,008] [INFO] [paperless.parsing.tika] Sending /tmp/paperless/paperless-upload-zf1ilcyo to Tika server
paperless-tika-1 | INFO [qtp2128195220-23] 11:24:15,195 org.apache.tika.server.resource.RecursiveMetadataResource rmeta/text (autodetecting type)
paperless-webserver-1 | [2022-02-01 12:24:15,631] [INFO] [paperless.parsing.tika] Converting /tmp/paperless/paperless-upload-zf1ilcyo to PDF as /tmp/paperless/paperless-agiq8vzt/convert.pdf
paperless-gotenberg-1 | {"level":"error","ts":1643714655.6423903,"logger":"api","msg":"code=404, message=Not Found","trace":"8662f7e2-1acd-4f7b-bfe0-fd235b6c1f59","remote_ip":"172.23.0.6","host":"gotenberg:3000","uri":"/convert/office","method":"POST","path":"/convert/office","referer":"","user_agent":"python-requests/2.26.0","status":404,"latency":2408520,"latency_human":"2.40852ms","bytes_in":31351,"bytes_out":9}
paperless-webserver-1 | [2022-02-01 12:24:15,647] [ERROR] [paperless.consumer] Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1 | Traceback (most recent call last):
paperless-webserver-1 | File "/usr/src/paperless/src/paperless_tika/parsers.py", line 79, in convert_to_pdf
paperless-webserver-1 | response.raise_for_status() # ensure we notice bad responses
paperless-webserver-1 | File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
paperless-webserver-1 | raise HTTPError(http_error_msg, response=self)
paperless-webserver-1 | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1 |
paperless-webserver-1 | During handling of the above exception, another exception occurred:
paperless-webserver-1 |
paperless-webserver-1 | Traceback (most recent call last):
paperless-webserver-1 | File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
paperless-webserver-1 | document_parser.parse(self.path, mime_type, self.filename)
paperless-webserver-1 | File "/usr/src/paperless/src/paperless_tika/parsers.py", line 65, in parse
paperless-webserver-1 | self.archive_path = self.convert_to_pdf(document_path, file_name)
paperless-webserver-1 | File "/usr/src/paperless/src/paperless_tika/parsers.py", line 81, in convert_to_pdf
paperless-webserver-1 | raise ParseError(
paperless-webserver-1 | documents.parsers.ParseError: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1 | 12:24:15 [Q] INFO Process-1:2 stopped doing work
paperless-webserver-1 | 12:24:15 [Q] INFO recycled worker Process-1:2
paperless-webserver-1 | 12:24:15 [Q] INFO Process-1:6 ready for work at 123
paperless-webserver-1 | 12:24:15 [Q] ERROR Failed [Dear Facilitators.docx] - Dear Facilitators.docx: Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office : Traceback (most recent call last):
paperless-webserver-1 | File "/usr/src/paperless/src/paperless_tika/parsers.py", line 79, in convert_to_pdf
paperless-webserver-1 | response.raise_for_status() # ensure we notice bad responses
paperless-webserver-1 | File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
paperless-webserver-1 | raise HTTPError(http_error_msg, response=self)
paperless-webserver-1 | requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1 |
paperless-webserver-1 | During handling of the above exception, another exception occurred:
paperless-webserver-1 |
paperless-webserver-1 | Traceback (most recent call last):
paperless-webserver-1 | File "/usr/local/lib/python3.9/site-packages/asgiref/sync.py", line 288, in main_wrap
paperless-webserver-1 | raise exc_info[1]
paperless-webserver-1 | File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
paperless-webserver-1 | document_parser.parse(self.path, mime_type, self.filename)
paperless-webserver-1 | File "/usr/src/paperless/src/paperless_tika/parsers.py", line 65, in parse
paperless-webserver-1 | self.archive_path = self.convert_to_pdf(document_path, file_name)
paperless-webserver-1 | File "/usr/src/paperless/src/paperless_tika/parsers.py", line 81, in convert_to_pdf
paperless-webserver-1 | raise ParseError(
paperless-webserver-1 | documents.parsers.ParseError: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1 |
paperless-webserver-1 | During handling of the above exception, another exception occurred:
paperless-webserver-1 |
paperless-webserver-1 | Traceback (most recent call last):
paperless-webserver-1 | File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker
paperless-webserver-1 | res = f(*task["args"], **task["kwargs"])
paperless-webserver-1 | File "/usr/src/paperless/src/documents/tasks.py", line 74, in consume_file
paperless-webserver-1 | document = Consumer().try_consume_file(
paperless-webserver-1 | File "/usr/src/paperless/src/documents/consumer.py", line 266, in try_consume_file
paperless-webserver-1 | self._fail(
paperless-webserver-1 | File "/usr/src/paperless/src/documents/consumer.py", line 70, in _fail
paperless-webserver-1 | raise ConsumerError(f"{self.filename}: {log_message or message}")
paperless-webserver-1 | documents.consumer.ConsumerError: Dear Facilitators.docx: Error while consuming document Dear Facilitators.docx: Error while converting document to PDF: 404 Client Error: Not Found for url: http://gotenberg:3000/convert/office
paperless-webserver-1 |
paperless-webserver-1 | [2022-02-01 12:24:17 +0100] [37] [CRITICAL] WORKER TIMEOUT (pid:40)
paperless-webserver-1 | [2022-02-01 12:24:17 +0100] [37] [WARNING] Worker with pid 40 was terminated due to signal 6
paperless@docker ~/paperless-ng$ cat docker-compose.yml
# docker-compose file for running paperless from the Docker Hub.
# This file contains everything paperless needs to run.
# Paperless supports amd64, arm and arm64 hardware.
#
# All compose files of paperless configure paperless in the following way:
#
# - Paperless is (re)started on system boot, if it was running before shutdown.
# - Docker volumes for storing data are managed by Docker.
# - Folders for importing and exporting files are created in the same directory
# as this file and mounted to the correct folders inside the container.
# - Paperless listens on port 8000.
#
# In addition to that, this docker-compose file adds the following optional
# configurations:
#
# - Instead of SQLite (default), PostgreSQL is used as the database server.
# - Apache Tika and Gotenberg servers are started with paperless and paperless
# is configured to use these services. These provide support for consuming
# Office documents (Word, Excel, Power Point and their LibreOffice counter-
# parts.
#
# To install and update paperless with this file, do the following:
#
# - Copy this file as 'docker-compose.yml' and the files 'docker-compose.env'
# and '.env' into a folder.
# - Run 'docker-compose pull'.
# - Run 'docker-compose run --rm webserver createsuperuser' to create a user.
# - Run 'docker-compose up -d'.
#
# For more extensive installation and update instructions, refer to the
# documentation.
version: "3.4"
services:
broker:
image: redis:6.0
restart: unless-stopped
db:
image: postgres:13
restart: unless-stopped
volumes:
- pgdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: paperless
POSTGRES_USER: paperless
POSTGRES_PASSWORD: paperless
webserver:
image: jonaswinkler/paperless-ng:latest
restart: unless-stopped
depends_on:
- db
- broker
- gotenberg
- tika
ports:
- 8000:8000
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000"]
interval: 30s
timeout: 10s
retries: 5
volumes:
- data:/usr/src/paperless/data
- media:/usr/src/paperless/media
- ./export:/usr/src/paperless/export
- /home/paperless/paperless-ng/consume:/usr/src/paperless/consume
env_file: docker-compose.env
environment:
PAPERLESS_REDIS: redis://broker:6379
PAPERLESS_DBHOST: db
PAPERLESS_TIKA_ENABLED: 1
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_ENDPOINT: http://tika:9998
gotenberg:
image: thecodingmachine/gotenberg
restart: unless-stopped
environment:
DISABLE_GOOGLE_CHROME: 1
tika:
image: apache/tika
restart: unless-stopped
volumes:
data:
media:
pgdata:
I have/had the same issue. It looks as if gotenberg has updated their API. An image that does work is thecodingmachine/gotenberg:6.0.0 . I'm unsure when the API was updated (it appears they're not respecting semvar?) as 6.4.4 did not work with paperless-ng either.
So the WA would be to use the 6.0.0 tag.
Paperless-ng will have to be updated to use the newer api which seems to all be under localhost:3000/forms
https://gotenberg.dev/docs/modules/libreoffice
https://github.com/jonaswinkler/paperless-ng/commit/2dcacaee147abfdccdca4e20262bae749c60be97
This commit actually fixes it. Just needs to be merged from dev to master and then a new docker image built and pushed.
I'd use the workaround until the maintainers push it to master.
As workaround you can use this
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://paperless-gotenberg:3000/forms/libreoffice/convert#
As workaround you can use this
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://paperless-gotenberg:3000/forms/libreoffice/convert#
The workaround if your setup is vanilla: docker-compose.yml:
# PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#
thx, that was really helpfull !
Unfortunatly this workaround didn't work for me.
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#
ist set.
I try to import a Word-Doc and get this error:
Error while converting document to PDF: 503 Server Error: Service Unavailable for url: http://gotenberg:3000/forms/libreoffice/convert#/forms/libreoffice/convert
I finally got gotenberg to work. The issue is that, for whatever reason, the container isn't publishing a network port.
Going into portainer and manually publishing the network port of host 3000 and container 3000 resolved the issue of gotenberg not being available. Or adding the lines
ports:
- 3000:3000
to a docker compose file works.
setting of
PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
should be used
Possible solutions I already tried:
Changed endpoint to PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000/forms/libreoffice/convert#
:
Resulted in the error message:
Error while converting document to PDF: 503 Server Error: Service Unavailable for url:
http://gotenberg:3000/forms/libreoffice/convert#/forms/libreoffice/convert
So changed the endpoint back to default and added the ports like @iplaughlin wrote, error message:
Error while converting document to PDF: 503 Server Error: Service Unavailable for url:
http://gotenberg:3000/forms/libreoffice/convert
Gotenberg log:
{
"level": "error",
"ts": 1650366380.1664767,
"logger": "api",
"msg": "convert to PDF: lock long-running LibreOffice listener: acquire LibreOffice listener lock: context deadline exceeded",
"trace": "52da9339-8761-4dca-bb2e-8ca269ce27ea",
"remote_ip": "172.18.0.6",
"host": "gotenberg:3000",
"uri": "/forms/libreoffice/convert",
"method": "POST",
"path": "/forms/libreoffice/convert",
"referer": "",
"user_agent": "python-requests/2.27.1",
"status": 503,
"latency": 30002593316,
"latency_human": "30.002593316s",
"bytes_in": 17375,
"bytes_out": 19
}
@CodeBrauer - I ended up spinning up gotenberg in its own container, outside of paperless.
For my setup only this worked:
image: gotenberg/gotenberg:7.4
(it seems it has to be a gotenberg version higher then 7)
neither the definition of ports nor the change in endpoint where succesful