paperless icon indicating copy to clipboard operation
paperless copied to clipboard

Don't overrride date in filename with detected dates in document

Open JasonSanDiego opened this issue 6 years ago • 6 comments

If a date is detected in the document AND in the filename, it seems like the latter should take priority. After all, if the user has taken the time to type a date, there's a pretty good chance that the user knows more than the system about the date that should be applied and wants the date that they typed.

Here are some scenarios, when the user knows more:

  • Tax document received in 2019 but applies to 2018. User enters 20181231 in filename to keep it organized with other tax documents, but document contains a date for 2019 when it was issued
  • Reprint of marriage certificate contains a date from 1978, but user received it in 2018 and wants to file it that way
  • Letter from a company about an unpaid bill contains a date referencing the bill last year, but it was received more recently.

Generalized: User has a specific date categorization scheme in mind and enters dates in filename; system overrides that scheme with dates it thinks it found in files without the user knowing.

JasonSanDiego avatar Oct 29 '18 13:10 JasonSanDiego

I think this may be due to you not having PAPERLESS_FILENAME_DATE_ORDER set in your paperless.conf file. The code currently doesn't attempt to guess the date from the file name unless you've defined this, lest you end up mixing up dmy vs mdy vs ymd date styles.

If you set that to your preffered format, does it work for you?

danielquinn avatar Feb 11 '19 08:02 danielquinn

Huh, that's interesting. I don't have that setting in my paperless.conf file, but filename dates are being parsed correctly in most cases. I'm running a fork from Oct 28th, so maybe things have changed since then?

At any rate, my issue was with the filename date not being respected in certain instances only. I wish I had kept the file that was causing the issue. I haven't been able to replicate it with a test file. I'm OK with closing this issue at this point, and I can reopen if I can find a file that replicates the issue on the latest commit.

JasonSanDiego avatar Feb 13 '19 05:02 JasonSanDiego

I'm having this problem as well. Conditions that replicate it are:

  • a document with the text "1 February 2019" written on it
  • file name is PDOC_20190322_0001.pdf
  • /etc/paperless.conf has both PAPERLESS_FILENAME_DATE_ORDER and PAPERLESS_DATE_ORDER set to "YMD"

When the document gets consumed, it is automatically given the date 2019-02-01 instead of 2019-03-22.

I'm not sure whether config reloading is an automatic thing or not, but in any case, the host machine has been rebooted since the config was set.

Is this expected behaviour? Am I doing something wrong?

halbrd avatar Mar 23 '19 12:03 halbrd

@halbrd I think at least for your second bullet point the filename doesn't meet the naming conventions for the guesswork to happen...?

FWIW I think I'm experiencing this issue but won't be home for a week or two so that I can test it out.

stgarf avatar Mar 24 '19 13:03 stgarf

My understanding from reading that page is that defining PAPERLESS_FILENAME_DATE_ORDER should cause Paperless to try to pull dates with the given format from the file name even if they don't match the rigid file name formatting.

halbrd avatar Mar 25 '19 06:03 halbrd

My understanding from reading that page is that defining PAPERLESS_FILENAME_DATE_ORDER should cause Paperless to try to pull dates with the given format from the file name even if they don't match the rigid file name formatting.

Hi @halbrd, my reading of the docs is that PAPERLESS_FILENAME_DATE_ORDER only affects the interpretation of the date string itself, but that date string should still come in the specified place in the file name, i.e. right at the start. Also, the separator for file name components is " - ", not "_". So I think per the docs, Paperless will interpret that whole file name as the Title alone.

However, assuming you can't change your file naming convention, you should be able to use PAPERLESS_FILENAME_PARSE_TRANSFORMS to convert it to what you need.

Try this in your paperless.conf:

PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^(PDOC)_(\\d{10})_(\\d{4})\\.", "repl":"\\2Z - \\1 - \\3."}]

That should make your date get picked up as the date, have "PDOC" as your Correspondent and 0001 as your Title. Edit to taste.

Hopefully that solves your problem?

rjendoubi avatar Apr 01 '20 22:04 rjendoubi