organize icon indicating copy to clipboard operation
organize copied to clipboard

Match dates in PDFs

Open chrisb86 opened this issue 5 years ago • 3 comments

I’m using hazel on the Mac since several years now but I’m attracted to have a platform independent solution like organize. On thing that I use a lot in hazel is date matching in PDFs.

You can define a pattern for a date and tell hazel to use the n th occurrence of a date with this pattern from the beginning or the end and load day, month and year in variables. I use this to get the date when an invoice was written and rename the file accordingly.

Would this be possible with organize?

chrisb86 avatar May 20 '20 21:05 chrisb86

Yes this is possible with the filecontent filter: https://organize.readthedocs.io/en/latest/page/filters.html#filecontent

But I have to admit writing the regex is not as nice as easy as defining the hazel pattern.

tfeldmann avatar Jun 03 '20 11:06 tfeldmann

I came here for exactly the same reason (except I use FileJuggler 2 on Windows). Would you consider integrating https://github.com/akoumjian/datefinder to make date-handling easier?

Thanks

Edit: Actually, nevermind.. datefinder isn't as good as I would have hoped. Using custom regex seems more useful.

Great tool by the way :)

anansii avatar Sep 09 '20 12:09 anansii

I'm no expert but this is the fairly manual solution approach I took to automatically sort files. I couldn't figure out how to "grab the third YY/MM/DD" but I used the echo: '{filecontent}' approach below to get the unique text around the specific date I wanted (e.g., the third one) and made a filecontent rule based on that. Is there a better approach?

config.yaml:

rules:
# Sort Invoices Using File Names and File Content
  - name: "Sort My Invoices"
    locations: ~/Downloads/ #adjust as needed
    subfolders: false #don't look in subfolders
    filters:
      - extension: pdf #the invoice is always a PDF, so only act on PDFs
      - regex: '.{8}-.{4}-.{4}-.{4}-.{12}' #regex for the consistent file name format when the invoice is downloaded from the web; I downloaded several to check the name format is consistent.
      - filecontent: 'Invoice' #whatever text appears in the PDF that differentiates it from other files. this is probably redundant since I am first filtering using a somewhat unique file naming format.
      - filecontent: '(?P<month>[01]\d)\/(?P<day>[0123]\d)\/(?P<year>\d{2})' #finds first instance of MM/DD/YY, assigns "month"/"day"/"year" variables for later use in the file name.
    actions:
      - move:
          # move to proper folder; rename to start with '20' because the file only contains "YY"  and not "YYYY". I want it the file name to be '2024.01.21 Invoice' but the format of the text in the document itself is "01/21/24" so I made those variables and added the "20" in front of the "YY".
          dest: '~/Documents/Invoices/20{filecontent.year}.{filecontent.month}.{filecontent.day} Invoice.pdf'
          on_conflict: 'skip' #skip if there is a conflict
          
# To see the contents of the file to inform the 'filecontent' filters above, use the below rule to get the raw text.
  - name: "View My Invoice"
    locations: ~/Downloads/test
    subfolders: false
    filters:
      - extension: pdf
      - filecontent
    actions:
      - echo: '{filecontent}'

ws923 avatar Feb 24 '24 21:02 ws923