llama_index Refactor download_loader to how loaders are downloaded.

Refactor download_loader to how loaders are downloaded.

Open ahmetkca opened this issue 1 year ago • 6 comments

This PR changes how loaders are downloaded from llama-hub. Originally, we downloaded readers under .modules directory. I propose that we create a directory called modules and make it a python module and download the readers under their corresponding directories such as modules/s3, modules/discord etc.

And we will have to update each loaders init.py in the llama-hub with:

from .base import <ReaderClassName>
from .<extra_file_name> import <ExtraClassName>

This allows us the use the downloaded readers as follows

from gpt_index import download_loader

download_loader("GithubRepositoryReader")
from modules.github_repo import GithubClient, GithubRepositoryReader

if __name__ == "__main__":
    github_client = GithubClient(...)
    loader = GithubRepositoryReader(github_client=github_client, ...)

   S3Reader = download_loader("S3Reader")
   s3_reader = S3Reader(...)

It doesn't break how we were originally using the download_loader.

While I was working on this PR. I found out that raw.githubusercontent.com doesn't update the content of the file immediately. See the following stackoverflow question: github-not-update-raw-after-commit. In my case the cache max-age was 24 hours.

So, I used https://api.github.com/repos/{OWNER}/{REPO}/contents/{PATH} GitHub REST API endpoint. It works without access token. The only caveat is that the rate limit for unauthorized requests is 600 per hour.

Feb 13 '23 08:02 ahmetkca

I would also like get your opinion on this as well @emptycrown, cc: @jerryjliu

Feb 13 '23 08:02 ahmetkca

I opened a PR against llama-hub to add GithubRepositoryReader. https://github.com/emptycrown/llama-hub/pull/34. fyi @emptycrown @jerryjliu

Feb 14 '23 02:02 ahmetkca

I opened a PR against llama-hub to add GithubRepositoryReader. emptycrown/llama-hub#34. fyi @emptycrown @jerryjliu

amazing, will take a look soon! thanks so much for the work @ahmetkca. on an unrelated note, do you have an example walkthrough / screenshots? i'd love to help publicize your work

Feb 14 '23 02:02 jerryjliu

hey @jerryjliu, thanks. I have simple examples which are good but I am working on a more advanced index construction with prompts specific to code files and GitHub repository. I hope it will give me better results. I am planning to get it done by this Friday. ( I know I said the same thing last week :) but hopefully I will get it done this week)

Feb 14 '23 02:02 ahmetkca

This looks awesome! Thanks for all the improvements. The 600 req/hr limit should be fine since we're caching everything. The website, for example, needs an API key, which I believe is 5000 req/hr.

We probably would have needed to do the extra_files thing anyways as the loaders became more complex so glad you set the groundwork for that. I'll take a closer look at the loader on LlamaHub.

Finally, we don't need to update every existing loader this way, do we? It's only if they way to be used via the from modules.github_repo import GithubClient, GithubRepositoryReader pattern.

Feb 14 '23 17:02 emptycrown

This looks awesome! Thanks for all the improvements. The 600 req/hr limit should be fine since we're caching everything. The website, for example, needs an API key, which I believe is 5000 req/hr.

We probably would have needed to do the extra_files thing anyways as the loaders became more complex so glad you set the groundwork for that. I'll take a closer look at the loader on LlamaHub.

Finally, we don't need to update every existing loader this way, do we? It's only if they way to be used via the from modules.github_repo import GithubClient, GithubRepositoryReader pattern.

Yes it shouldn't break any existing stuff. It still returns the Reader. We can update their init files to provide two way of using.

Feb 14 '23 22:02 ahmetkca

llama_index llama_index copied to clipboard

Refactor download_loader to how loaders are downloaded.

llama_index
llama_index copied to clipboard