paperless
paperless copied to clipboard
Filename guessing mixing Correspondent and Title
Guesswork does not seem to work correctly for me when I let it consume files in the "DateTime - Correspondent - Title - tag.pdf" format. I get this console output:
Consuming /consume/21001231Z - Hello - World Test - !TODO.pdf
convert: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1749.
** Processing: /tmp/paperless/paperless-zxi0h09_/convert.png
378x34 pixels, 16 bits/pixel, grayscale
Input IDAT size = 11893 bytes
Input file size = 11950 bytesTrying:
zc = 9 zm = 9 zs = 0 f = 0 IDAT size = 11408
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 11408Selecting parameters:
zc = 9 zm = 8 zs = 0 f = 0 IDAT size = 11408Output file: /tmp/paperless/paperless-zxi0h09_/optipng.png
Output IDAT size = 11408 bytes (485 bytes decrease)
Output file size = 11465 bytes (485 bytes = 4.06% decrease)Checking document title for date
convert: CorruptImageProfile `xmp' @ warning/profile.c/SetImageProfileInternal/1749.
Processing sheet #1: /tmp/paperless/paperless-zxi0h09_/convert-0000.pnm -> /tmp/paperless/paperless-zxi0h09_/convert-0000.unpaper.pnm
[image2 @ 0x5558f05405e0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x5558f05405e0] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for deu
Unable to detect date for document
Completed
Document 21001231000000: Hello - World Test - !TODO consumption finished
Attached here my test file: 21001231Z - Hello - World Test - !TODO.pdf
The date is correctly shown (though it is shown in MM/dd/YYYY despite German locale in settings) but the correspondent and title got mixed together. Is this a bug or am I doing something wrong here?
I forgot to set PAPERLESS_DATE_ORDER=YMD. I only had PAPERLESS_FILENAME_DATE_ORDER=YMD but it looks like you need both. A quick test confirmed that it is working. I will test it a bit more and if it works I will close this issue.
So, I tested a bit more and at the moment it looks like another issue is when you use special characters in the file name. For instance, I have a tag "!TODO" because I want this tag to appear at the top of the list. This exclamation mark likely messes up the regex for guesswork because when I remove it from the filename it seems to work as expected.
This exclamation mark likely messes up the regex for guesswork because when I remove it from the filename it seems to work as expected.
Can you elaborate on "messes up"? From what I can tell, it seems that the regex for tags is defined as [a-z0-9\-,]*
(reference)
@Tooa I can try but it does not get better than my OP.
- I created a tag in paperless named "!TODO"
- My files are named in the format "DateTime - Correspondent - Title - Tag.pdf
- For example, 20201231 - Company - Hello World - !TODO.pdf
The use of this exclamation mark in the file name causes the following issue after Paperless consumed it:
- Correspondent and Title of the filename are used as Correspondent
- The tag "!TODO" is the Title.
Please see my attached screenshot. It should be clearer what I mean. After I changed my filename to "20201231 - Company - Hello World - TODO.pdf" (without an exclamation mark) everything gets consumed as expected.
I agree with you that based on the regex this should not happen but then I am no regex expert and certainly not with the regex implementation in Python. It might be that ! or other special characters are interpreted as some special functions.
"20201231 - Company - Hello World - !TODO.pdf" or "20201231Z - Company - Hello World - !TODO.pdf"?
With the Z. Again see my OP and my console output:
Consuming /consume/21001231Z - Hello - World Test - !TODO.pdf
I was just asking since you missed the Z
in your second reply:
For example, 20201231 - Company - Hello World - !TODO.pdf
You also got me wrong with the regex. The regex [a-z0-9\-,]*
does not match the !
character. So, I guess a solution to your problem would be to define the regex as [a-z0-9\-,!]*
. However, I don`t know the historical reason for defining the regex without special characters.
@Tooa you are right, my fault, sorry! I wrongly assumed that every character allowed (on Windows) as a filename would work. I digged a bit in the docu and found this restriction here: https://paperless.readthedocs.io/en/latest/consumption.html#http-post for correspondents. Maybe this would be a good addition for title, tags as well? What do you think?