paperless icon indicating copy to clipboard operation
paperless copied to clipboard

Filename guessing mixing Correspondent and Title

Open sceiler opened this issue 5 years ago • 8 comments

Guesswork does not seem to work correctly for me when I let it consume files in the "DateTime - Correspondent - Title - tag.pdf" format. I get this console output:

Consuming /consume/21001231Z - Hello - World Test - !TODO.pdf
convert: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1749.
** Processing: /tmp/paperless/paperless-zxi0h09_/convert.png
378x34 pixels, 16 bits/pixel, grayscale
Input IDAT size = 11893 bytes
Input file size = 11950 bytes

Trying:
zc = 9 zm = 9 zs = 0 f = 0 IDAT size = 11408
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 11408

Selecting parameters:
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 11408

Output file: /tmp/paperless/paperless-zxi0h09_/optipng.png

Output IDAT size = 11408 bytes (485 bytes decrease)
Output file size = 11465 bytes (485 bytes = 4.06% decrease)

Checking document title for date
convert: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1749.
Processing sheet #1: /tmp/paperless/paperless-zxi0h09_/convert-0000.pnm -> /tmp/paperless/paperless-zxi0h09_/convert-0000.unpaper.pnm
[image2 @ 0x5558f05405e0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x5558f05405e0] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for deu
Unable to detect date for document
Completed
Document 21001231000000: Hello - World Test - !TODO consumption finished

Attached here my test file: 21001231Z - Hello - World Test - !TODO.pdf

Capture

The date is correctly shown (though it is shown in MM/dd/YYYY despite German locale in settings) but the correspondent and title got mixed together. Is this a bug or am I doing something wrong here?

sceiler avatar Nov 28 '19 20:11 sceiler

I forgot to set PAPERLESS_DATE_ORDER=YMD. I only had PAPERLESS_FILENAME_DATE_ORDER=YMD but it looks like you need both. A quick test confirmed that it is working. I will test it a bit more and if it works I will close this issue.

sceiler avatar Dec 05 '19 21:12 sceiler

So, I tested a bit more and at the moment it looks like another issue is when you use special characters in the file name. For instance, I have a tag "!TODO" because I want this tag to appear at the top of the list. This exclamation mark likely messes up the regex for guesswork because when I remove it from the filename it seems to work as expected.

sceiler avatar Dec 11 '19 11:12 sceiler

This exclamation mark likely messes up the regex for guesswork because when I remove it from the filename it seems to work as expected.

Can you elaborate on "messes up"? From what I can tell, it seems that the regex for tags is defined as [a-z0-9\-,]* (reference)

Tooa avatar Jan 05 '20 11:01 Tooa

@Tooa I can try but it does not get better than my OP.

  • I created a tag in paperless named "!TODO"
  • My files are named in the format "DateTime - Correspondent - Title - Tag.pdf
  • For example, 20201231 - Company - Hello World - !TODO.pdf

The use of this exclamation mark in the file name causes the following issue after Paperless consumed it:

  • Correspondent and Title of the filename are used as Correspondent
  • The tag "!TODO" is the Title.

Please see my attached screenshot. It should be clearer what I mean. After I changed my filename to "20201231 - Company - Hello World - TODO.pdf" (without an exclamation mark) everything gets consumed as expected.

I agree with you that based on the regex this should not happen but then I am no regex expert and certainly not with the regex implementation in Python. It might be that ! or other special characters are interpreted as some special functions.

sceiler avatar Jan 07 '20 00:01 sceiler

"20201231 - Company - Hello World - !TODO.pdf" or "20201231Z - Company - Hello World - !TODO.pdf"?

Tooa avatar Jan 07 '20 16:01 Tooa

With the Z. Again see my OP and my console output:

Consuming /consume/21001231Z - Hello - World Test - !TODO.pdf

sceiler avatar Jan 07 '20 17:01 sceiler

I was just asking since you missed the Z in your second reply:

For example, 20201231 - Company - Hello World - !TODO.pdf

You also got me wrong with the regex. The regex [a-z0-9\-,]* does not match the ! character. So, I guess a solution to your problem would be to define the regex as [a-z0-9\-,!]*. However, I don`t know the historical reason for defining the regex without special characters.

Tooa avatar Jan 07 '20 17:01 Tooa

@Tooa you are right, my fault, sorry! I wrongly assumed that every character allowed (on Windows) as a filename would work. I digged a bit in the docu and found this restriction here: https://paperless.readthedocs.io/en/latest/consumption.html#http-post for correspondents. Maybe this would be a good addition for title, tags as well? What do you think?

sceiler avatar Jan 15 '20 16:01 sceiler