ocr_for_transcribing_pdf_slides icon indicating copy to clipboard operation
ocr_for_transcribing_pdf_slides copied to clipboard

Additional features or depricated

Open datatalking opened this issue 2 years ago • 7 comments

I work with several large libraries of pdf documents and have been trying to expand on your great code which has saved me time but a more expansive tool is needed that would save thousands of hours.

Use case - decades of books or photos/scans of pages hand transcribed Use case - terrabytes of pdf sitting in archives

We could expand on this to prompt it for a path, scan the path and say so many files have been found, we could use an EDA tool for data accuracy and save it to a SQLite db.

datatalking avatar Jun 27 '22 09:06 datatalking

Agreed! :)

EnkrateiaLucca avatar Apr 26 '23 00:04 EnkrateiaLucca

@EnkrateiaLucca I've finished a working test of the first part to begin scanning a pdf archive, an archive of photos, and PDF files. Hopefully, this spring, I can work on this intermittently between my NLP classes, but enabling wide support of the Computer Vision elements of this feature will likely require a deep functionality specialist I can't provide.

In addition, I have a local working solution that prompts user for a path, scan the path and say so many files have been found

I've also built a function that uses features from an EDA tool for data accuracy and it already works to roughly autosave it to a SQLite db

datatalking avatar Apr 26 '23 20:04 datatalking

Thats awesommeee

Sent from my iPad

On 26 Apr 2023, at 21:30, Andrew Schell @.***> wrote:



@EnkrateiaLuccahttps://github.com/EnkrateiaLucca I've finished a working test of the first part to begin scanning a pdf archive, an archive of photos, and PDF files. Hopefully, this spring, I can work on this intermittently between my NLP classes, but enabling wide support of the Computer Vision elements of this feature will likely require a deep functionality specialist I can't provide.

In addition, I have a local working solution that prompts user for a path, scan the path and say so many files have been found

I've also built a function that uses features from an EDA tool for data accuracy and it already works to roughly autosave it to a SQLite db

— Reply to this email directly, view it on GitHubhttps://github.com/EnkrateiaLucca/ocr_for_transcribing_pdf_slides/issues/1#issuecomment-1524007366, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXKPSGHXH2UO2OL34B57E3XDGAW3ANCNFSM5Z5XRCHA. You are receiving this because you were mentioned.Message ID: @.***>

EnkrateiaLucca avatar Apr 26 '23 20:04 EnkrateiaLucca

@EnkrateiaLucca I'm curious on your thoughts or the kind of code/work you write/do. Most if not all of mine is for data science analysis, so its less DEVops and more full stack analytics ML and NLP.

Are you learning or doing more Data Engineering, DEVops, SE, Front-end, or Full-stack?

datatalking avatar Apr 26 '23 20:04 datatalking

Actually yeah, I am doing more now for work, and because you kind of feel a bit useless as a ML engineer if you can’t deploy your stuff easily….LoL

Sent from my iPad

On 26 Apr 2023, at 21:46, Andrew Schell @.***> wrote:



@EnkrateiaLuccahttps://github.com/EnkrateiaLucca I'm curious on your thoughts or the kind of code/work you write/do. Most if not all of mine is for data science analysis, so its less DEVops and more full stack analytics ML and NLP.

Are you learning or doing more Data Engineering, DEVops, SE, Front-end, or Full-stack?

— Reply to this email directly, view it on GitHubhttps://github.com/EnkrateiaLucca/ocr_for_transcribing_pdf_slides/issues/1#issuecomment-1524024527, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXKPSD4QTFLNNNF52RBMVDXDGCRHANCNFSM5Z5XRCHA. You are receiving this because you were mentioned.Message ID: @.***>

EnkrateiaLucca avatar Apr 26 '23 20:04 EnkrateiaLucca

@EnkrateiaLucca I'm still getting errors when I try to load tensorflow or opencv so I'll keep working on it. When I have something I'll submit a PR. Do you ever work with the PDFminer6 repo? Seems like we are doing similar work there, figure we could collaborate.

datatalking avatar Apr 26 '23 21:04 datatalking

Actually I don’t know that repo, will check it out thou! Thanks ! :)

Sent from my iPad

On 26 Apr 2023, at 22:46, Andrew Schell @.***> wrote:



@EnkrateiaLuccahttps://github.com/EnkrateiaLucca I'm still getting errors when I try to load tensorflow or opencv so I'll keep working on it. When I have something I'll submit a PR. Do you ever work with the PDFminer6 repo? Seems like we are doing similar work there, figure we could collaborate.

— Reply to this email directly, view it on GitHubhttps://github.com/EnkrateiaLucca/ocr_for_transcribing_pdf_slides/issues/1#issuecomment-1524083939, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHXKPSHOD6GRICEO7RLN5O3XDGJSBANCNFSM5Z5XRCHA. You are receiving this because you were mentioned.Message ID: @.***>

EnkrateiaLucca avatar Apr 26 '23 22:04 EnkrateiaLucca