gh_repo_download
gh_repo_download copied to clipboard
PDF Support
It would be neat to be able to parse the text from PDFs and include them.
Most of the work would happen in downloader.file_utils.extract_text_files
and supporting functions. You'd have to determine if a file was a PDF and then implement a function for extracting text from it. Not too big of a deal.
PDFs are kind of hairy to get nice text from, but luckily LLMs are pretty good at dealing with poorly-formatted text so we don't have to get crazy with making sure everything is perfect.
I think this shouldn't be too bad, so I'm going to mark this as a good first issue. Hopefully I'm not under-thinking it.
Do you have any libs in mind that we could use for pdf parsing ?
I've used PyMuPDF to good success in the past. I don't know if it's the best choice nowadays, so I'm open to suggestions.
Got it. So from what I understand, we basically need to allow users to upload PDF and extract the same information that we currently do from repos, right ? Please correct me if I am wrong here
So, what happens now is that the user selects a github repo and then we go through all the files from the repo and only select the plain text files. https://github.com/dmwyatt/gh_repo_download/blob/0f4d5315e3d8648c3a0aac50d5cfe5b03743db9a/downloader/file_utils.py#L234
Right here we check if the file is plain text...
https://github.com/dmwyatt/gh_repo_download/blob/0f4d5315e3d8648c3a0aac50d5cfe5b03743db9a/downloader/file_utils.py#L303
The updates this issue needs are going to revolve around saying something like if is plain text **or** if is PDF
and in the if is PDF
branch, extract the PDF contents and save it with text_files[member.filename] = content