excalibur
excalibur copied to clipboard
Feature request: Option to Link PDF URL, refreshes each time download page is accessed
I'd like to request an option to link PDFs (since PDF data often updates, it's much easier to keep data updated by linking than by manually uploading each time).
When a PDF is linked, and the rules have been set for that particular PDF, whenever the download page is accessed(example download page link: http://127.0.0.1:5000/jobs/3c90fc1b-a9d8-4d51-a83a-218d18d4893f), it automatically downloads from URL, then re-processes it with pre-defined rule, and displays the tables of the extracted data. This should work by just accessing the download page.
I can perhaps hire someone to get this done if you're willing to add it to the main project.
So steps are:
- Link to PDF (example: https://www.lcfcu.org/home/fiFiles/static/documents/rates.pdf)
- Set Rules for the PDF and save
- Access the download page for that pdf (example: http://127.0.0.1:5000/jobs/3c90fc1b-a9d8-4d51-a83a-218d18d4893f)
- Excalibur automatically fetches the PDF from link
- Extracts data from PDF based on predefined rule
- Displays like so:
So in the future, whenever I detect the pdf has changed, I can access the download page link and it'll repeat the entire process again(steps 3-6).
Again, please let me know if you're open to have this change contribute to main source, if so, I can get it coded. I feel this change is extremely important since many pdf on web change frequently thus making this feature very useful.
@majestique Thanks for the detailed explanation! I think it will be a good feature to have. But instead of implicitly processing the PDF whenever the download page is accessed, I would be in favor of of a more explicit approach:
- Click on the "Extract again" button and selecting the rule.
- Click on "View data" which would start an extraction job. (Downloading the new PDF and extracting tables using the saved rule)
The download page just shows the data that was extracted after a job finishes, the previous page is actually where the job is triggered.
I'm open to having this in the main project. Since camelot now supports reading PDFs from a url, adding this should be relatively easy and I should be able to do that. Instead of hiring someone, you can donate to the project on opencollective to support development. :)
The reason for implicitly viewing data is automation. I have thousands of PDF links to different sites to manage.
Perhaps there can be a separate link for extracting the data and viewing the data by simply accessing the page? Any manual clicks will be a significant burden if there are thousands of PDFs to manage.
So ideally when accessing this special link:
- Excalibur re-downloads the pdf and extract
- Shows the data table HTML and have the option to output in CSV (by adding to URL
&output=csv
)
Hi @vinayak-mehta, I love this development and have supported it 👍(planning to do more also) just wondering if you think the steps below could be included?
- Excalibur re-downloads the pdf and extract, on page access (no clicking)
- Shows the data table HTML and have the option to output in CSV (by adding to URL &output=csv)
Also do you have a rough estimate when this could be achieved if this feature is accepted?
Thank you!