docspell icon indicating copy to clipboard operation
docspell copied to clipboard

Multi-document upload with filename(s) containing Umlauts stuck

Open nekrondev opened this issue 1 year ago • 3 comments

Environment

Docspell: v0.42.0 Joex: using docker image found at ghcr.io/docspell/joex:latest (i.e. Debian-based image with fixed Tesseract)

Issue

Today I uploaded a multi-document ZIP archive into Docspell using the manual document upload feature, but document processing got stuck.

The contents found inside the ZIP archive was:

grafik

I uploaded the archive using the following manual upload settings:

grafik

The processing of the filename containing umlauts (ü) crashes processing of the archive. The error message I got from the job queue was Malformed input or input contains unmappable characters: /tmp/docspell-zip-9930113460477770389/123456_2024_Anpassung der Ausführungsfristen bei Echtzeitüberweisungen_vom_2024.11.01_20241101101038.pdf.

grafik

(Note: Inside the screenshots I only made some private number at the beginning of the PDF document unrecognizable but you can replace that with a six digit random number)

I'm not sure if this issue may be present in earlier versions of Docspell, because it was the first time my bank send me a document containing umlauts.

Workaround

I've uploaded the plain document and this seems to be fine.

Testdata

Here is a test archive containing only one PDF filename with umlauts that had been zipped with Windows FileExplorer. However, this time Docspell tells me the error was invalid CEN header (bad entry name or comment). Zipping the file with 7Zip returns the same error message so somehow the processing of ZIP file streams with UTF-8 chars seems to be broken.

grafik

testdata.zip

nekrondev avatar Nov 01 '24 10:11 nekrondev

Ok, found a working solution for UTF-8 encoded ZIP archive filenames not processed by the Ubuntu 24.04 based image. The docker-compose.yml file must contain an environment setting to tell java runtime that it should use UTF-8 instead of some ANSI collation.

So you need to add this to joex and restapi-server service:

environment:
      - LANG=C.utf8

Uploading the archive now does a correct processing of the filename as UTF-8. However, the CEN header error still exists if I upload my 7Zip or Window FileExplorer created archive but that might be some Windows ZIP/7Zip related issue as some other test archive I created was processed ok.

nekrondev avatar Nov 01 '24 15:11 nekrondev

Could this be a variant of #2825 (fixed by #2853) ?

pschichtel avatar Jan 07 '25 22:01 pschichtel

Sorry if I am asking basic things, but I am not familiar with the yaml syntax. I am working with the latest docker image. For me adding the line like this or with different indentation

environment:
      - LANG=C.utf8

causes a YAML error.

How to correctly set in this block?

  restserver:
    image: ghcr.io/docspell/restserver:latest
    hostname: docspell-restserver
    container_name: docspell-restserver
    restart: unless-stopped
    ports:
      - "7880:7880"
    environment:
      TZ: 'Europe/Berlin'
      DOCSPELL_SERVER_INTERNAL__URL: 'http://docspell-restserver:7880'
      DOCSPELL_SERVER_ADMIN__ENDPOINT_SECRET: 'admin123'
      DOCSPELL_SERVER_AUTH_SERVER__SECRET: ''
      DOCSPELL_SERVER_BACKEND_JDBC_PASSWORD: 'dbpass'
      DOCSPELL_SERVER_BACKEND_JDBC_URL: 'jdbc:postgresql://db:5432/dbname'
      DOCSPELL_SERVER_BACKEND_JDBC_USER: 'dbuser'
      DOCSPELL_SERVER_BIND_ADDRESS: '0.0.0.0'
      DOCSPELL_SERVER_FULL__TEXT__SEARCH_ENABLED: 'true'
      DOCSPELL_SERVER_FULL__TEXT__SEARCH_SOLR_URL: 'http://docspell-solr:8983/solr/docspell'
      DOCSPELL_SERVER_INTEGRATION__ENDPOINT_ENABLED: 'true'
      DOCSPELL_SERVER_INTEGRATION__ENDPOINT_HTTP__HEADER_ENABLED: 'true'
      DOCSPELL_SERVER_INTEGRATION__ENDPOINT_HTTP__HEADER_HEADER__VALUE: 'integration-password123'
      DOCSPELL_SERVER_BACKEND_SIGNUP_MODE: 'open'
      DOCSPELL_SERVER_BACKEND_SIGNUP_NEW__INVITE__PASSWORD: ''
      DOCSPELL_SERVER_BACKEND_ADDONS_ENABLED: 'false'

falko-strenzke avatar Feb 01 '25 07:02 falko-strenzke