langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Issue: Rate limiting on large Google Drive folder

Open chrisedington opened this issue 2 years ago • 1 comments

I've got a GoogleDriveLoader implemented with a exponential backoff and a sleep function to try further mitigate rate limits, but I still get rate limit errors from Google.

Even though I've added a time.sleep(5) statement, I assume it only takes effect before each attempt to load all the documents, not between individual API calls within the load() method.

google_loader = GoogleDriveLoader(
        folder_id="xxxxxxxxx",
        credentials_path="credentials.json",
        token_path="token.json",
        file_types=["document", "sheet", "pdf"],
        file_loader_cls=UnstructuredFileIOLoader,
        recursive=True,
        verbose=True,
    )

    @retry(
        stop=stop_after_attempt(7), wait=wait_exponential(multiplier=2, min=60, max=300)
    )
    def load_documents():
        time.sleep(5)  # delay for 5 seconds
        return google_loader.load()

    try:
        google_docs = load_documents()
    except:
        logging.error("Exceeded retry attempts for Google API rate limit.")
        raise

The exception output:

ERROR:root:Exceeded retry attempts for Google API rate limit.
IndexError: list index out of range

Stacktrace:

DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/A_xx?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/A_xx?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/A_xx?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/BF%20IS?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/BF%20BS%20?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/Reporting%20IS?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/Reporting%20BS?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/Statistics?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/Val_Summary?alt=json
DEBUG:googleapiclient.discovery:URL being requested: GET https://sheets.googleapis.com/v4/spreadsheets/xxxxx-xxxxx/values/Val_Workings?alt=json
ERROR:root:Exceeded retry attempts for Google API rate limit.
Traceback (most recent call last):
  File "/Users/chris/Repositories/xxxx/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/Users/chris/Repositories/xxxx/ingest.py", line 44, in load_documents
    return google_loader.load()
  File "/Users/chris/Repositories/xxxx/.venv/lib/python3.10/site-packages/langchain/document_loaders/googledrive.py", line 347, in load
    return self._load_documents_from_folder(
  File "/Users/chris/Repositories/xxxx/.venv/lib/python3.10/site-packages/langchain/document_loaders/googledrive.py", line 248, in _load_documents_from_folder
    returns.extend(self._load_sheet_from_id(file["id"]))  # type: ignore
  File "/Users/chris/Repositories/xxxx/.venv/lib/python3.10/site-packages/langchain/document_loaders/googledrive.py", line 173, in _load_sheet_from_id
    header = values[0]
IndexError: list index out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/chris/Repositories/xxxx/ingest.py", line 68, in <module>
    ingest_docs()
  File "/Users/chris/Repositories/xxxx/ingest.py", line 47, in ingest_docs
    google_docs = load_documents()
  File "/Users/chris/Repositories/xxxx/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/Users/chris/Repositories/xxxx/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/Users/chris/Repositories/xxxx/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x142dfab30 state=finished raised IndexError>]

Suggestion:

Due to the recursive function, and the use case for most people being to load a large Drive folder, would it be possible to implement a rate limiter into the loader itself to slow down the individual API calls?

Alternatively, does anyone have any recommendations on how to better implement an exponential backoff?

chrisedington avatar Jul 07 '23 06:07 chrisedington

Answer generated by a 🤖

Answer

I understand that you're encountering rate limit errors with Google API calls despite implementing an exponential backoff and a sleep function. The time.sleep(5) statement you've added only affects the attempts to load all the documents, not the individual API calls within the load() method. This is likely why you're still hitting the rate limit.

To address this issue, one possible solution is to modify the GoogleDriveLoader class to include a delay between each API call. Here's how I would do it:

import time
from langchain.document_loaders import GoogleDriveLoader

class RateLimitedGoogleDriveLoader(GoogleDriveLoader):
    def _load_sheet_from_id(self, file_id):
        time.sleep(5)  # delay for 5 seconds
        return super()._load_sheet_from_id(file_id)

    def _load_document_from_id(self, file_id):
        time.sleep(5)  # delay for 5 seconds
        return super()._load_document_from_id(file_id)

    def _load_file_from_id(self, file_id):
        time.sleep(5)  # delay for 5 seconds
        return super()._load_file_from_id(file_id)

google_loader = RateLimitedGoogleDriveLoader(
    folder_id="xxxxxxxxx",
    credentials_path="credentials.json",
    token_path="token.json",
    file_types=["document", "sheet", "pdf"],
    file_loader_cls=UnstructuredFileIOLoader,
    recursive=True,
    verbose=True,
)

In this code, I've created a new class RateLimitedGoogleDriveLoader that inherits from GoogleDriveLoader. I've overridden the _load_sheet_from_id, _load_document_from_id, and _load_file_from_id methods to include a delay before each API call. This should help to mitigate the rate limit errors you're experiencing.

Please note that this is just one potential solution and might not be the most efficient one. Depending on your specific use case, you might want to consider using a more sophisticated rate limiting strategy, such as the token bucket or leaky bucket algorithm.

If you find this solution helpful and believe it could benefit others, I encourage you to make a pull request. Your contribution would be greatly appreciated by the community.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

dosubot[bot] avatar Jul 07 '23 11:07 dosubot[bot]

The solution above should work (I have not tested it). Additionally, I would like to see some kind of callback so we can begin to process found documents rather than loading the entire list into memory and then experiencing some issue further down the line and have to redo the entire process (it is not uncommon for us to have drives with tens of thousands of documents).

tedsecretsource avatar Sep 22 '23 07:09 tedsecretsource

Hi, @chrisedington! I'm helping the LangChain team manage their backlog and am marking this issue as stale.

It looks like you raised an issue regarding encountering rate limit errors from the Google Drive API, despite implementing exponential backoff and sleep functions. There were suggestions from me to modify the GoogleDriveLoader class and from tedsecretsource to add a callback for processing found documents to avoid loading the entire list into memory.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation!

dosubot[bot] avatar Dec 22 '23 16:12 dosubot[bot]