langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Add full Google Drive features.

Open pprados opened this issue 1 year ago • 16 comments

Reimplement the Google Drive features

Propose :

  • langchain.docstore.GoogleDriveDocStore
  • langchain.document_loaders.GoogleDriveLoader
  • langchain.utilities.GoogleDriveAPIWrapper
  • langchain.tools.GoogleDriveSearchTool
  • langchain.utilities.GoogleDriveUtilities

Features:

  • Fully compatible with Google Drive API
  • Manage file in trash
  • Manage shortcut
  • Manage file description
  • Paging with request GDrive list()
  • Multiple kind of template for request GDrive
  • Convert a lot of mime type (can be configured). The list is adjusted according to the packages availables
  • Convert GDoc, GSheet and GSlide with differents modes
  • Can use only the description of files, without loading and conversion of the body
  • Lambda fine filter
  • Remove duplicate documents (in case of shortcut)
  • Add Url to documents (or part of documents like specific slide)
  • Use environment variable for reference an API tokens
  • Manage different king of strange state with Google File (absence of URL, etc.)
  • Use fully lazy strategy to save memory

Recognition

If you accept my pull-request, you can mention me @pprados. Thanks

Before submitting

Unit-tests coverage >80% of new code

No integration test, but some notebook to show how to use.

  • docs/modules/agents/tools/examples/google_drive.ipynb
  • docs/modules/indexes/document_loaders/examples/google_drive.ipynb
  • docs/modules/indexes/retrievers/examples/google_drive.ipynb

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

@eyurtsev @hwchase17 @vowelparrot might be interested

pprados avatar May 23 '23 15:05 pprados

@vowelparrot I have 3 workflow awaitings approval. You must accept to start these jobs?

pprados avatar May 25 '23 09:05 pprados

I'd gladly collaborate on improving Google Drive support on LangChain. The above test failled since googleapiclient is a new requirement for poetry.lock.

Yvelo avatar May 25 '23 19:05 Yvelo

I resolve this problem. I test the github action in my branch, and now all workflow are correct.

pprados avatar May 30 '23 13:05 pprados

Sorry. I fix the format now. It's not possible to start by hand, the workflow lint

pprados avatar May 30 '23 13:05 pprados

@eyurtsev Can you run the workflows to validate this version ?

pprados avatar May 31 '23 10:05 pprados

After this pull-request, I will normalize the link with google token, to use one scope for google drive, gmail, etc.

pprados avatar May 31 '23 10:05 pprados

@eyurtsev, can you active the workflows, and if all is correct, can you review this code?

pprados avatar Jun 01 '23 13:06 pprados

@eyurtsev, sorry for the last error. In my branch, the workflows run without error.

pprados avatar Jun 02 '23 07:06 pprados

@eyurtsev, I have a question about my implementation.

In the method _lazy_load_file_from_file(), I had an optional run_manager. I can use this manager if I catch an exception.

if isinstance(run_manager, CallbackManagerForToolRun):
  run_manager.on_tool_error(e)
elif isinstance(run_manager, CallbackManagerForChainRun):
  run_manager.on_chain_error(e)

But the code is not clean. Do you have a better idea?

I initiate the possibility to use a lazy approach in the retriever.

def lazy_get_relevant_documents(self,query:str) -> Iterator[Document]:
   ...

The default implementation transforms a classic list of documents to an iterator. But, the subclasses can be choice to implements a lazy approach, to optimize the memory footprint.

In my Google Drive utilities, I use a lazy approach.

Later, I would like to update the link between the loaders and vectordb, to use a lazy approach if it's possible. Then, a loader can return a big number of documents to import, without problems with the memory.

pprados avatar Jun 02 '23 10:06 pprados

The code changes every day, so I must make rebase another time. @eyurtsev, If you star the workflows quickly, all will be correct ;-)

pprados avatar Jun 05 '23 05:06 pprados

Hello @hwchase17, The master moves every day. So, sometime I have the 3 workflows OK, sometime I have no conflicts with the base branch. Else, I must rebase every day. Can you analyze this pull request or find someone to do it?

Thanks

pprados avatar Jun 06 '23 09:06 pprados

@hwchase17, @eyurtsev, Another time, all checks have passed, but I have a conflicting files: poetry.lock.

I rebase my code with the last version. Can you review the code before the next release ? Please ;-)

pprados avatar Jun 09 '23 07:06 pprados

Hello @eyurtsev, @hwchase17,

I'm sure you've got plenty of pull requests to validate. New version with a rebase from 12 June. Maybe, other people can revu my code. I have been waiting for 3 weeks now. Thanks

pprados avatar Jun 12 '23 08:06 pprados

Hello @eyurtsev, @hwchase17,

I'm sure you've got plenty of pull requests to validate. New version with a rebase from 12 June. Maybe, other people can revu my code. I have been waiting for 3 weeks now. Thanks

Unfortunately, I don't believe I can properly participate in the review of @pprados contribution. However, it is clearly a significant and useful improvement on the current implementation.

Yvelo avatar Jun 12 '23 12:06 Yvelo

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Sep 5, 2023 2:16pm

vercel[bot] avatar Jun 16 '23 11:06 vercel[bot]

@eyurtsev or @agola11, may be, you can participate the review? Thanks

pprados avatar Jun 16 '23 14:06 pprados

Hello @hwchase17, Because the last version (205) have a bug, it's not possible to build the doc, and check my code. I waiting another version to update my pull request.

pprados avatar Jun 19 '23 07:06 pprados

@pprados is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Jun 20 '23 10:06 vercel[bot]

I've adjusted the code from the version 0.0.216.

If someone would like to test the different implementation, before the pull request is validated, you can visit here.

@eyurtsev, please, I can do something to validate this pull request or tell me what it is.

pprados avatar Jun 27 '23 16:06 pprados

hey @pprados, sorry for delay! this pr is a bit tough to review given it's size. would it make sense to split it up? sounds like there's at least 5 fairly modular/independent classes being added

cc @hinthornw

baskaryan avatar Jun 30 '23 22:06 baskaryan

Hello @baskaryan and @hinthornw,

"5 fairly modular/independent classes being added". It's not really the case. All these modular/independent classes extend the same class [GoogleDriveUtilities](https://github.com/pprados/langchain/blob/pprados/google_drive/langchain/tools/google_drive/tool.py) with just some modification in the default values.

The effort is essentially in this file. The rest is easy to analyze.

Some unit-test, some little modification to generate the doc with a hierarchy.

I will fix the code, another time Monday, to update the code with the last version of langchain. But may be, you can comment the current version.

Thanks

pprados avatar Jul 01 '23 08:07 pprados

Hello @baskaryan and @hinthornw,

Thank you for taking the time to analyze my code. I've just resynchronized it with the upstream/master branch (version 222), and I check the Github Action test by hand, in my clone. All tests have been passed. I hope this will also be the case on the official repo.

I'm sorry if it's so long, but there are a number of different levels, just to make it's complete. The code implements an utility, an API wrapper, a Store and a DocumentLoader.

My English isn't perfect, so please feel free to correct the documentation if necessary.

I'm trying to help with these information. To analyze this pull-request, I propose these steps:

More, I don't know what to do ;-)

I hope you'll now be able to validate my pull-request.

pprados avatar Jul 04 '23 13:07 pprados

:worried:

Hello @eyurtsev @hwchase17 @vowelparrot, @baskaryan, and @hinthornw,

Once again, I've resynchronized the code. I've been desperately waiting for a code review for 10 weeks. I suppose this is the last time I'll be modifying the code before reluctantly giving up.

Thank you for taking the time to analyze my code.

I've just resynchronized it with the upstream/master branch (version 250). I cannot check the Github Action test manually now; it's deactivated.

Quick review

For a quick review of the features:

Full review

I apologize for its length, but there are several different levels to ensure completeness. The code implements a utility, an API wrapper, a tool, and a DocumentLoader.

My English isn't perfect, so please feel free to correct the documentation if necessary.

I'm providing this information to assist with the review. To analyze this pull-request, I propose the following steps:

  • Please try using it:

  • Now, let's review the root class langchain.utilities.google_drive.GoogleDriveUtilities

    • It's the implementation of the GDriver API
    • All other classes extend this one, with specific default parameters and an implementation of the @abstract methods.
    • The code uses a lazy strategy (only one Document is loaded in memory)
    • The code can load GDoc, GSheet, and GSlide documents, or, for other documents, use a dictionary with an association between the mime-type and a loader.
    • A loop loads the first page of list() API, extracts some files, and continues with the next pages.
    • Some modifications were applied if the document is a shortcut, or if some documents lack metadata, etc.
    • All list() parameters can be set in this class.
    • To initialize the credentials, I propose using an environment variable GOOGLE_ACCOUNT_FILE. The file can be for a user or a service (I analyze the file to use the correct API).
  • Now, let's review the class langchain.utilities.google_drive.GoogleDriveAPIWrapper

    • This class extends GoogleDriveUtilities
    • Since this wrapper is used with tools, the return value must be short.
    • I propose different modes to use a snippet or the first lines of the body.
    • I have implemented the method run(query) and, like bing_search, results().
  • Now, let's review the class langchain.document_loaders.google_drive.GoogleDriveLoader

    • This class extends GoogleDriveUtilities
    • To be compatible with the older version of GoogleDriveLoader, I have managed the deprecated parameters service_account_key and credentials_path.
    • I have also implemented the parameters documents_ids and file_ids, but I declare that these parameters are deprecated.
    • I have implemented the lazy_load() and load() APIs
    • I propose a tool to update the description attribute in GDrive file with a summary of the body of the document (lazy_update_description_with_summary() and update_description_with_summary()).
  • Now, let's review the class langchain.retriever.google_drive.GoogleDriveRetriever

    • This class extends GoogleDriveUtilities
    • It implements get_relevant_documents()
  • Now, let's review the class langchain.tools.google_drive.tools.GoogleDriveSearchTool

    • This class accepts a standard parameter api_wrapper of type GoogleDriveAPIWrapper
    • It proposes a method to search for documents in GDrive
  • To declare these classes in different default locations, I updated these files:

  • Now, it's time to review the unit tests

/retrievers/test_google_drive.py)

  • And finally, let's assess the impact on the documentation.
  • Please read the file google_drive.mdx

Furthermore, I'm not sure what else to do ;-)

This is the last time I'll be requesting a review of the code if there's no commitment to review.

I hope you'll now be able to validate my pull-request.

pprados avatar Aug 03 '23 11:08 pprados

Hey @pprados, reviewing now. May take a few days because there's a lot going on and this is changing an existing class, so we need to be sure we're backwards compatible and not breaking anything. Appreciate your patience (and the time you put into this!)

baskaryan avatar Aug 03 '23 20:08 baskaryan

I just fix the github script to reject the lint (and a little bug in test)

pprados avatar Aug 04 '23 13:08 pprados

Hello @baskaryan, I check the backwards compatibility of code, and I fix some bugs. When I started the code, the attributes file_loader_cls, file_loader_kwargs, load_trashed_files, file_types were not present.

I tested the code with the previous notebook, and all can be executed.

I add a specific chapter in unit-test, for the backwards compatibility.

I find some bugs or limitations in the current implementation.

Can you tell me when you plan to review the code? I have to admit I'm a bit worried, given all the commitments I've broken for so long.

pprados avatar Aug 09 '23 14:08 pprados

Hello @eyurtsev, @hwchase17, @vowelparrot, @baskaryan and @hinthornw,

Can someone contact me via discord, to organize a commented review, if that would make things easier?

pprados avatar Aug 22 '23 06:08 pprados

https://github.com/langchain-ai/langchain/pull/9999

pprados avatar Sep 05 '23 14:09 pprados