langchain
langchain copied to clipboard
Add full Google Drive features.
Reimplement the Google Drive features
Propose :
- langchain.docstore.GoogleDriveDocStore
- langchain.document_loaders.GoogleDriveLoader
- langchain.utilities.GoogleDriveAPIWrapper
- langchain.tools.GoogleDriveSearchTool
- langchain.utilities.GoogleDriveUtilities
Features:
- Fully compatible with Google Drive API
- Manage file in trash
- Manage shortcut
- Manage file description
- Paging with request GDrive list()
- Multiple kind of template for request GDrive
- Convert a lot of mime type (can be configured). The list is adjusted according to the packages availables
- Convert GDoc, GSheet and GSlide with differents modes
- Can use only the description of files, without loading and conversion of the body
- Lambda fine filter
- Remove duplicate documents (in case of shortcut)
- Add Url to documents (or part of documents like specific slide)
- Use environment variable for reference an API tokens
- Manage different king of strange state with Google File (absence of URL, etc.)
- Use fully lazy strategy to save memory
Recognition
If you accept my pull-request, you can mention me @pprados. Thanks
Before submitting
Unit-tests coverage >80% of new code
No integration test, but some notebook to show how to use.
- docs/modules/agents/tools/examples/google_drive.ipynb
- docs/modules/indexes/document_loaders/examples/google_drive.ipynb
- docs/modules/indexes/retrievers/examples/google_drive.ipynb
Who can review?
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:
@eyurtsev @hwchase17 @vowelparrot might be interested
@vowelparrot I have 3 workflow awaitings approval. You must accept to start these jobs?
I'd gladly collaborate on improving Google Drive support on LangChain. The above test failled since googleapiclient is a new requirement for poetry.lock.
I resolve this problem. I test the github action in my branch, and now all workflow are correct.
Sorry. I fix the format now. It's not possible to start by hand, the workflow lint
@eyurtsev Can you run the workflows to validate this version ?
After this pull-request, I will normalize the link with google token, to use one scope for google drive, gmail, etc.
@eyurtsev, can you active the workflows, and if all is correct, can you review this code?
@eyurtsev, sorry for the last error. In my branch, the workflows run without error.
@eyurtsev, I have a question about my implementation.
In the method _lazy_load_file_from_file(), I had an optional run_manager. I can use this manager if I catch an exception.
if isinstance(run_manager, CallbackManagerForToolRun):
run_manager.on_tool_error(e)
elif isinstance(run_manager, CallbackManagerForChainRun):
run_manager.on_chain_error(e)
But the code is not clean. Do you have a better idea?
I initiate the possibility to use a lazy approach in the retriever.
def lazy_get_relevant_documents(self,query:str) -> Iterator[Document]:
...
The default implementation transforms a classic list of documents to an iterator. But, the subclasses can be choice to implements a lazy approach, to optimize the memory footprint.
In my Google Drive utilities, I use a lazy approach.
Later, I would like to update the link between the loaders and vectordb, to use a lazy approach if it's possible. Then, a loader can return a big number of documents to import, without problems with the memory.
The code changes every day, so I must make rebase another time. @eyurtsev, If you star the workflows quickly, all will be correct ;-)
Hello @hwchase17, The master moves every day. So, sometime I have the 3 workflows OK, sometime I have no conflicts with the base branch. Else, I must rebase every day. Can you analyze this pull request or find someone to do it?
Thanks
@hwchase17, @eyurtsev, Another time, all checks have passed, but I have a conflicting files: poetry.lock.
I rebase my code with the last version. Can you review the code before the next release ? Please ;-)
Hello @eyurtsev, @hwchase17,
I'm sure you've got plenty of pull requests to validate. New version with a rebase from 12 June. Maybe, other people can revu my code. I have been waiting for 3 weeks now. Thanks
Hello @eyurtsev, @hwchase17,
I'm sure you've got plenty of pull requests to validate. New version with a rebase from 12 June. Maybe, other people can revu my code. I have been waiting for 3 weeks now. Thanks
Unfortunately, I don't believe I can properly participate in the review of @pprados contribution. However, it is clearly a significant and useful improvement on the current implementation.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
langchain | ⬜️ Ignored (Inspect) | Visit Preview | Sep 5, 2023 2:16pm |
@eyurtsev or @agola11, may be, you can participate the review? Thanks
Hello @hwchase17, Because the last version (205) have a bug, it's not possible to build the doc, and check my code. I waiting another version to update my pull request.
@pprados is attempting to deploy a commit to the LangChain Team on Vercel.
A member of the Team first needs to authorize it.
I've adjusted the code from the version 0.0.216.
If someone would like to test the different implementation, before the pull request is validated, you can visit here.
@eyurtsev, please, I can do something to validate this pull request or tell me what it is.
hey @pprados, sorry for delay! this pr is a bit tough to review given it's size. would it make sense to split it up? sounds like there's at least 5 fairly modular/independent classes being added
cc @hinthornw
Hello @baskaryan and @hinthornw,
"5 fairly modular/independent classes being added".
It's not really the case. All these modular/independent classes extend the same class [GoogleDriveUtilities](https://github.com/pprados/langchain/blob/pprados/google_drive/langchain/tools/google_drive/tool.py)
with just some modification in the default values.
The effort is essentially in this file. The rest is easy to analyze.
Some unit-test, some little modification to generate the doc with a hierarchy.
I will fix the code, another time Monday, to update the code with the last version of langchain. But may be, you can comment the current version.
Thanks
Hello @baskaryan and @hinthornw,
Thank you for taking the time to analyze my code.
I've just resynchronized it with the upstream/master
branch (version 222), and I check the Github Action test by hand, in my clone. All tests have been passed.
I hope this will also be the case on the official repo.
I'm sorry if it's so long, but there are a number of different levels, just to make it's complete. The code implements an utility, an API wrapper, a Store and a DocumentLoader.
My English isn't perfect, so please feel free to correct the documentation if necessary.
I'm trying to help with these information. To analyze this pull-request, I propose these steps:
-
Try to use it:
-
git clone https://github.com/pprados/langchain.git ; cd langchain
-
poetry run jupyter notebook
- Test 3 notebooks:
- Open The notebook for GoogleDriveLoader and test
- Open The notebook for GoogleDriveRetriever and test
- Open The notebook for GoogleDrive Tool and test
-
-
Read the root class langchain.utilitiles.google_drive.GoogleDriveUtilities
- It's the implementation of GDriver API
- All other classes extend this one, with specific default parameters
and an implementation of the
@abstract
methods. - The code uses a lazy strategy (only one
Document
is loaded in memory) - The code can load GDoc, GSheet and GSlide documents, or, for other documents, use a dictionary with and association between the mime-type and a loader.
- A loop loads the first page of
list()
API, extracts some files, and continue with the next pages. - Some stuff was applied if the document is a shortcut, if some documents haven't some metadata, etc.
- All list() parameters It can be set in this class.
- To initialize the credential, I propose to use an environment variable
GOOGLE_ACCOUNT_FILE
. The file can be for a user or a service (I analyze the file to use the correct API).
-
Read the class langchain.utilities.google_drive.GoogleDriveAPIWrapper
- This class extend ̀
GoogleDriveUtilities
- Because this wrapper is used with tools, the return value must be short.
- I propose different modes to use a snippet or the first lines of the body.
- I implement the method
run(query)
and, like bing_search,results()
.
- This class extend ̀
-
Read the class langchain.document_loaders.google_drive.GoogleDriveLoader
- This class extend ̀
GoogleDriveUtilities
- To be compatible with the older version of
GoogleDriveLoader
, I manage the deprecated parametersservice_account_key
andcredentials_path
.: - I also implement the parameters
documents_ids
andfile_ids
, but I declare these parameters are deprecated. - I implement the
lazy_load()
andload()
API - I propose a tool to update the
description
attribute in GDrive file with the summary of the body of the document (lazy_update_description_with_summary()
andupdate_description_with_summary()
.
- This class extend ̀
-
Read the class langchain.retriever.google_drive.GoogleDriveRetriever
- This class extends
GoogleDriveUtilities
- Implement
get_relevent_documents()
- This class extends
-
Read the class langchain.tools.google_drive.tools.GoogleDriveSearchTool
- This class accepts a standard parameter
api_wrapper
with the typeGoogleDriveAPIWrapper
- And propose a description to search documents in GDrive
- This class accepts a standard parameter
-
Read the class langchain.docstore.google_drive.GoogleDriveDocStore
- This class extends
GoogleDriveUtilities
andDocstore
- Implement
search()
- This class extends
-
To declare these classes in different defaults locations, I updated these files:
-
It's time to read the unit-tests
-
Read the file
tests/unit_tests/utilities/test_google_drive.py
- This class simulates different situations with the GDrive API.)
- Some tests can use the files from
tests/unit_tests/utilities/examples/*
-
Read the file
tests/unit_tests/tools/test_google_drive.py
-
Read the file
tests/unit_tests/document_loaders/test_google_drive.py
-
Read the file
tests/unit_tests/retrievers/test_google_drive.py
-
Read the file
tests/unit_tests/docstore/test_google_drive.py
-
-
And, the impact on the documentation. Because the documentation must be generated with the inherited classes, I add some parameters
- Read the file
docs/extras/integrations/google_drive.mdx
- Read the file
docs/api_reference.rst'
- At this time, it's impossible to use
make docs_build
ormake docs_compile
. I can not test the generated docs. :-(
- Read the file
More, I don't know what to do ;-)
I hope you'll now be able to validate my pull-request.
:worried:
Hello @eyurtsev @hwchase17 @vowelparrot, @baskaryan, and @hinthornw,
Once again, I've resynchronized the code. I've been desperately waiting for a code review for 10 weeks. I suppose this is the last time I'll be modifying the code before reluctantly giving up.
Thank you for taking the time to analyze my code.
I've just resynchronized it with the upstream/master
branch (version 250). I cannot check the Github Action test manually now; it's deactivated.
Quick review
For a quick review of the features:
- Please read this notebook with a document loader
- Please read this notebook with a retriever
- And read this notebook with a tool
Full review
I apologize for its length, but there are several different levels to ensure completeness. The code implements a utility, an API wrapper, a tool, and a DocumentLoader.
My English isn't perfect, so please feel free to correct the documentation if necessary.
I'm providing this information to assist with the review. To analyze this pull-request, I propose the following steps:
-
Please try using it:
-
git clone https://github.com/pprados/langchain.git ; cd langchain
-
poetry run jupyter notebook
- Test 3 notebooks:
- Open The notebook for GoogleDriveLoader and test
- Open The notebook for GoogleDriveRetriever and test
- Open The notebook for GoogleDrive Tool and test
-
-
Now, let's review the root class langchain.utilities.google_drive.GoogleDriveUtilities
- It's the implementation of the GDriver API
- All other classes extend this one, with specific default parameters and an implementation of the
@abstract
methods. - The code uses a lazy strategy (only one
Document
is loaded in memory) - The code can load GDoc, GSheet, and GSlide documents, or, for other documents, use a dictionary with an association between the mime-type and a loader.
- A loop loads the first page of
list()
API, extracts some files, and continues with the next pages. - Some modifications were applied if the document is a shortcut, or if some documents lack metadata, etc.
- All list() parameters can be set in this class.
- To initialize the credentials, I propose using an environment variable
GOOGLE_ACCOUNT_FILE
. The file can be for a user or a service (I analyze the file to use the correct API).
-
Now, let's review the class langchain.utilities.google_drive.GoogleDriveAPIWrapper
- This class extends
GoogleDriveUtilities
- Since this wrapper is used with tools, the return value must be short.
- I propose different modes to use a snippet or the first lines of the body.
- I have implemented the method
run(query)
and, likebing_search
,results()
.
- This class extends
-
Now, let's review the class langchain.document_loaders.google_drive.GoogleDriveLoader
- This class extends
GoogleDriveUtilities
- To be compatible with the older version of
GoogleDriveLoader
, I have managed the deprecated parametersservice_account_key
andcredentials_path
. - I have also implemented the parameters
documents_ids
andfile_ids
, but I declare that these parameters are deprecated. - I have implemented the
lazy_load()
andload()
APIs - I propose a tool to update the
description
attribute in GDrive file with a summary of the body of the document (lazy_update_description_with_summary()
andupdate_description_with_summary()
).
- This class extends
-
Now, let's review the class langchain.retriever.google_drive.GoogleDriveRetriever
- This class extends
GoogleDriveUtilities
- It implements
get_relevant_documents()
- This class extends
-
Now, let's review the class langchain.tools.google_drive.tools.GoogleDriveSearchTool
- This class accepts a standard parameter
api_wrapper
of typeGoogleDriveAPIWrapper
- It proposes a method to search for documents in GDrive
- This class accepts a standard parameter
-
To declare these classes in different default locations, I updated these files:
-
Now, it's time to review the unit tests
-
Please read the file
tests/unit_tests/utilities/test_google_drive.py
- This class simulates different situations with the GDrive API.
- Some tests can use the files from
tests/unit_tests/utilities/examples/*
-
Please read the file
tests/unit_tests/tools/test_google_drive.py
-
Please read the file
tests/unit_tests/document_loaders/test_google_drive.py
-
Please read the file [
tests/unit_tests/retrievers/test_google_drive.py
](https://github.com/pprados/langchain/blob/pprados/google_drive/libs/langchain/tests/unit_tests
-
/retrievers/test_google_drive.py)
- And finally, let's assess the impact on the documentation.
- Please read the file google_drive.mdx
Furthermore, I'm not sure what else to do ;-)
This is the last time I'll be requesting a review of the code if there's no commitment to review.
I hope you'll now be able to validate my pull-request.
Hey @pprados, reviewing now. May take a few days because there's a lot going on and this is changing an existing class, so we need to be sure we're backwards compatible and not breaking anything. Appreciate your patience (and the time you put into this!)
I just fix the github script to reject the lint (and a little bug in test)
Hello @baskaryan,
I check the backwards compatibility of code, and I fix some bugs.
When I started the code, the attributes file_loader_cls
, file_loader_kwargs
, load_trashed_files
, file_types
were not present.
I tested the code with the previous notebook, and all can be executed.
I add a specific chapter in unit-test, for the backwards compatibility.
I find some bugs or limitations in the current implementation.
Can you tell me when you plan to review the code? I have to admit I'm a bit worried, given all the commitments I've broken for so long.
Hello @eyurtsev, @hwchase17, @vowelparrot, @baskaryan and @hinthornw,
Can someone contact me via discord, to organize a commented review, if that would make things easier?
https://github.com/langchain-ai/langchain/pull/9999