gh_repo_download PDF Support

It would be neat to be able to parse the text from PDFs and include them.

Most of the work would happen in downloader.file_utils.extract_text_files and supporting functions. You'd have to determine if a file was a PDF and then implement a function for extracting text from it. Not too big of a deal.

PDFs are kind of hairy to get nice text from, but luckily LLMs are pretty good at dealing with poorly-formatted text so we don't have to get crazy with making sure everything is perfect.

I think this shouldn't be too bad, so I'm going to mark this as a good first issue. Hopefully I'm not under-thinking it.

Mar 26 '24 15:03 dmwyatt

Do you have any libs in mind that we could use for pdf parsing ?

Mar 30 '24 19:03 SwarajBaral

I've used PyMuPDF to good success in the past. I don't know if it's the best choice nowadays, so I'm open to suggestions.

Mar 30 '24 22:03 dmwyatt

Got it. So from what I understand, we basically need to allow users to upload PDF and extract the same information that we currently do from repos, right ? Please correct me if I am wrong here

Mar 31 '24 16:03 SwarajBaral

So, what happens now is that the user selects a github repo and then we go through all the files from the repo and only select the plain text files. https://github.com/dmwyatt/gh_repo_download/blob/0f4d5315e3d8648c3a0aac50d5cfe5b03743db9a/downloader/file_utils.py#L234

Right here we check if the file is plain text...

https://github.com/dmwyatt/gh_repo_download/blob/0f4d5315e3d8648c3a0aac50d5cfe5b03743db9a/downloader/file_utils.py#L303

The updates this issue needs are going to revolve around saying something like if is plain text **or** if is PDF and in the if is PDF branch, extract the PDF contents and save it with text_files[member.filename] = content

Mar 31 '24 20:03 dmwyatt

gh_repo_download gh_repo_download copied to clipboard

PDF Support

gh_repo_download
gh_repo_download copied to clipboard