[GitHub Loader] Add support for loading specific folder and branch of a github repository
🚀 The feature
Users want to load a specific folder from a github repository. Moreover, they want to load the data from a specific branch and not the default branch.
Motivation, pitch
Requested by a user on discord community: https://discord.com/channels/1121119078191480945/1125758905310519327/1204150824868126790
@deshraj Can I pick this up?
@Dev-Khant sure go for it.
Hi @deshraj,
Here to get data for repo, branch and for specific folder I think using get_repo function from Github library would be easier compared to the current approach of cloning the repo and then traversing the tree. For extracting specific file we can directly use get_contents.
Docs:
- get_repo: https://pygithub.readthedocs.io/en/latest/examples/Repository.html#get-all-of-the-contents-of-the-root-directory-of-the-repository
- get_contents: https://pygithub.readthedocs.io/en/latest/examples/Repository.html#get-a-specific-content-file
I have previously worked around this approach: https://github.com/Dev-Khant/Analyze-Github-Code/blob/main/LLM/scrap.py#L33
Making a change to this will only affect the query with type=="repo".
Let me know if I can move ahead with this approach.
Yeah this seems like a reasonable approach to me as well. Please proceed with this approach.
@deshraj Here do we need to store data from results because currently data variable is already getting replaced by self._get_github_repo_data. Let me know if we want to add data from results or just the content of repo.
Ah good catch. This seems like a bug and should be fixed. Can you please fix it in your PR?
Yes I can fix it. But do we have to add results to data or just the repo contents?
@deshraj I have raised the PR.