ocr_for_transcribing_pdf_slides
ocr_for_transcribing_pdf_slides copied to clipboard
Additional features or depricated
I work with several large libraries of pdf documents and have been trying to expand on your great code which has saved me time but a more expansive tool is needed that would save thousands of hours.
Use case - decades of books or photos/scans of pages hand transcribed Use case - terrabytes of pdf sitting in archives
We could expand on this to prompt it for a path, scan the path and say so many files have been found, we could use an EDA tool for data accuracy and save it to a SQLite db.
Agreed! :)
@EnkrateiaLucca I've finished a working test of the first part to begin scanning a pdf archive, an archive of photos, and PDF files. Hopefully, this spring, I can work on this intermittently between my NLP classes, but enabling wide support of the Computer Vision elements of this feature will likely require a deep functionality specialist I can't provide.
In addition, I have a local working solution that prompts user for a path, scan the path and say so many files have been found
I've also built a function that uses features from an EDA tool for data accuracy
and it already works to roughly autosave it to a SQLite db
Thats awesommeee
Sent from my iPad
On 26 Apr 2023, at 21:30, Andrew Schell @.***> wrote:
@EnkrateiaLuccahttps://github.com/EnkrateiaLucca I've finished a working test of the first part to begin scanning a pdf archive, an archive of photos, and PDF files. Hopefully, this spring, I can work on this intermittently between my NLP classes, but enabling wide support of the Computer Vision elements of this feature will likely require a deep functionality specialist I can't provide.
In addition, I have a local working solution that prompts user for a path, scan the path and say so many files have been found
I've also built a function that uses features from an EDA tool for data accuracy and it already works to roughly autosave it to a SQLite db
— Reply to this email directly, view it on GitHubhttps://github.com/EnkrateiaLucca/ocr_for_transcribing_pdf_slides/issues/1#issuecomment-1524007366, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXKPSGHXH2UO2OL34B57E3XDGAW3ANCNFSM5Z5XRCHA. You are receiving this because you were mentioned.Message ID: @.***>
@EnkrateiaLucca I'm curious on your thoughts or the kind of code/work you write/do. Most if not all of mine is for data science analysis, so its less DEVops and more full stack analytics ML and NLP.
Are you learning or doing more Data Engineering, DEVops, SE, Front-end, or Full-stack?
Actually yeah, I am doing more now for work, and because you kind of feel a bit useless as a ML engineer if you can’t deploy your stuff easily….LoL
Sent from my iPad
On 26 Apr 2023, at 21:46, Andrew Schell @.***> wrote:
@EnkrateiaLuccahttps://github.com/EnkrateiaLucca I'm curious on your thoughts or the kind of code/work you write/do. Most if not all of mine is for data science analysis, so its less DEVops and more full stack analytics ML and NLP.
Are you learning or doing more Data Engineering, DEVops, SE, Front-end, or Full-stack?
— Reply to this email directly, view it on GitHubhttps://github.com/EnkrateiaLucca/ocr_for_transcribing_pdf_slides/issues/1#issuecomment-1524024527, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXKPSD4QTFLNNNF52RBMVDXDGCRHANCNFSM5Z5XRCHA. You are receiving this because you were mentioned.Message ID: @.***>
@EnkrateiaLucca I'm still getting errors when I try to load tensorflow or opencv so I'll keep working on it. When I have something I'll submit a PR. Do you ever work with the PDFminer6 repo? Seems like we are doing similar work there, figure we could collaborate.
Actually I don’t know that repo, will check it out thou! Thanks ! :)
Sent from my iPad
On 26 Apr 2023, at 22:46, Andrew Schell @.***> wrote:
@EnkrateiaLuccahttps://github.com/EnkrateiaLucca I'm still getting errors when I try to load tensorflow or opencv so I'll keep working on it. When I have something I'll submit a PR. Do you ever work with the PDFminer6 repo? Seems like we are doing similar work there, figure we could collaborate.
— Reply to this email directly, view it on GitHubhttps://github.com/EnkrateiaLucca/ocr_for_transcribing_pdf_slides/issues/1#issuecomment-1524083939, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXKPSHOD6GRICEO7RLN5O3XDGJSBANCNFSM5Z5XRCHA. You are receiving this because you were mentioned.Message ID: @.***>