llama_index
llama_index copied to clipboard
Refactor download_loader to how loaders are downloaded.
This PR changes how loaders are downloaded from llama-hub
. Originally, we downloaded readers under .modules
directory.
I propose that we create a directory called modules
and make it a python module and download the readers under their corresponding directories such as modules/s3
, modules/discord
etc.
And we will have to update each loaders init.py in the llama-hub with:
from .base import <ReaderClassName>
from .<extra_file_name> import <ExtraClassName>
This allows us the use the downloaded readers as follows
from gpt_index import download_loader
download_loader("GithubRepositoryReader")
from modules.github_repo import GithubClient, GithubRepositoryReader
if __name__ == "__main__":
github_client = GithubClient(...)
loader = GithubRepositoryReader(github_client=github_client, ...)
S3Reader = download_loader("S3Reader")
s3_reader = S3Reader(...)
It doesn't break how we were originally using the download_loader.
While I was working on this PR. I found out that raw.githubusercontent.com
doesn't update the content of the file immediately. See the following stackoverflow question: github-not-update-raw-after-commit. In my case the cache max-age was 24 hours.
So, I used https://api.github.com/repos/{OWNER}/{REPO}/contents/{PATH}
GitHub REST API endpoint. It works without access token. The only caveat is that the rate limit for unauthorized requests is 600 per hour.
I would also like get your opinion on this as well @emptycrown, cc: @jerryjliu
I opened a PR against llama-hub to add GithubRepositoryReader
. https://github.com/emptycrown/llama-hub/pull/34. fyi @emptycrown @jerryjliu
I opened a PR against llama-hub to add
GithubRepositoryReader
. emptycrown/llama-hub#34. fyi @emptycrown @jerryjliu
amazing, will take a look soon! thanks so much for the work @ahmetkca. on an unrelated note, do you have an example walkthrough / screenshots? i'd love to help publicize your work
hey @jerryjliu, thanks. I have simple examples which are good but I am working on a more advanced index construction with prompts specific to code files and GitHub repository. I hope it will give me better results. I am planning to get it done by this Friday. ( I know I said the same thing last week :) but hopefully I will get it done this week)
This looks awesome! Thanks for all the improvements. The 600 req/hr limit should be fine since we're caching everything. The website, for example, needs an API key, which I believe is 5000 req/hr.
We probably would have needed to do the extra_files
thing anyways as the loaders became more complex so glad you set the groundwork for that. I'll take a closer look at the loader on LlamaHub.
Finally, we don't need to update every existing loader this way, do we? It's only if they way to be used via the from modules.github_repo import GithubClient, GithubRepositoryReader
pattern.
This looks awesome! Thanks for all the improvements. The 600 req/hr limit should be fine since we're caching everything. The website, for example, needs an API key, which I believe is 5000 req/hr.
We probably would have needed to do the
extra_files
thing anyways as the loaders became more complex so glad you set the groundwork for that. I'll take a closer look at the loader on LlamaHub.Finally, we don't need to update every existing loader this way, do we? It's only if they way to be used via the
from modules.github_repo import GithubClient, GithubRepositoryReader
pattern.
Yes it shouldn't break any existing stuff. It still returns the Reader
. We can update their init files to provide two way of using.