AutoGPT icon indicating copy to clipboard operation
AutoGPT copied to clipboard

Fix split_text chunking bug

Open vzla0094 opened this issue 2 years ago • 27 comments

Background

Handle long paragraphs in split_text function by splitting them into smaller chunks, ensuring that no chunk exceeds the max_length.

Fixes: https://github.com/Significant-Gravitas/Auto-GPT/issues/1820, https://github.com/Significant-Gravitas/Auto-GPT/issues/1211, https://github.com/Significant-Gravitas/Auto-GPT/issues/796, https://github.com/Significant-Gravitas/Auto-GPT/issues/38

Changes

  • Updated split_text function to handle paragraphs longer than max_length by splitting them into smaller chunks
  • Added a while loop to process long paragraphs and create sub_paragraphs of length max_length
  • Maintained consistency with the original implementation for appending chunks to current_chunk and updating current_length

Documentation

  • Added comments in the code explaining step by step the chunk splitting logic

Test Plan

  • Manually test the updated split_text function with different input text scenarios, including long paragraphs and varying max_length values
  • Ensure that the function works as expected and no chunks exceed the specified max_length

PR Quality Checklist

  • [x] My pull request is atomic and focuses on a single change.
  • [x] I have thoroughly tested my changes with multiple different prompts.
  • [x] I have considered potential risks and mitigations for my changes.
  • [x] I have documented my changes clearly and comprehensively.
  • [x] I have not snuck in any "extra" small tweaks changes

vzla0094 avatar Apr 17 '23 04:04 vzla0094

Asked the team to merge out of band

nponeccop avatar Apr 17 '23 11:04 nponeccop

@vzla0094 we aren't merging into stable, can you change the base branch back to master?

Pwuts avatar Apr 17 '23 12:04 Pwuts

I'm not ready to merge this as is due to code quality. It looks unpythonic.

Code should self-document. We don't say i += 1 # add one to i.

And there is surely a Pythonic way to chunk a string using something out of itertools maybe, or using a generator.

p-i- avatar Apr 17 '23 12:04 p-i-

Closing this as I think #2062 is doing this better

p-i- avatar Apr 17 '23 16:04 p-i-

Hey @p-i-, neither #2062 nor #2088 fix the mentioned issue, as also stated inside #2062.

I've checked both solutions by applying the changes to the stable branch, and neither fixed it. The error happens usually with large texts, especially on long URLs.

openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 9221 tokens (9221 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

vaknin avatar Apr 17 '23 20:04 vaknin

@p-i- The referenced #2062 doesn't address the split_text function which is the one involved in the "max_token_limit" error. See @vaknin message

This one does, I can find a way to tidy this one up if you'd like yo re-open it

vzla0094 avatar Apr 17 '23 21:04 vzla0094

Sure, go ahead. And as @p-i- already mentioned, in rewriting the PR, using existing functionality from the standard library is preferable over DIY implementations. :)

Pwuts avatar Apr 17 '23 22:04 Pwuts

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions[bot] avatar Apr 17 '23 22:04 github-actions[bot]

Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.

github-actions[bot] avatar Apr 18 '23 01:04 github-actions[bot]

Sure, go ahead. And as @p-i- already mentioned, in rewriting the PR, using existing functionality from the standard library is preferable over DIY implementations. :)

Just pushed an update removing the comments and restructuring also. Still don't think it's really easy to understand tho, but what do you guys think? feel free to push modifications or I could also use some third party library for chunking like Funcy.

Not a python dev, just trying out things, code works tho but please feel free to point me in the right direction

vzla0094 avatar Apr 18 '23 01:04 vzla0094

It seems like it still doesn't resolve exceeding the maximum content length. I've merged the latest PR from @vzla0094 into the stable branch, and The InvalidRequestError remains. Full traceback:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\kivan\Desktop\Auto-GPT\autogpt\__main__.py", line 53, in <module>
    main()
  File "C:\Users\kivan\Desktop\Auto-GPT\autogpt\__main__.py", line 49, in main
    agent.start_interaction_loop()
  File "C:\Users\kivan\Desktop\Auto-GPT\autogpt\agent\agent.py", line 65, in start_interaction_loop
    assistant_reply = chat_with_ai(
                      ^^^^^^^^^^^^^
  File "C:\Users\kivan\Desktop\Auto-GPT\autogpt\chat.py", line 85, in chat_with_ai
    else permanent_memory.get_relevant(str(full_message_history[-9:]), 10)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kivan\Desktop\Auto-GPT\autogpt\memory\local.py", line 124, in get_relevant
    embedding = create_embedding_with_ada(text)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kivan\Desktop\Auto-GPT\autogpt\llm_utils.py", line 137, in create_embedding_with_ada
    return openai.Embedding.create(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kivan\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai\api_resources\embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kivan\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai\api_resources\abstract\engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\kivan\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai\api_requestor.py", line 226, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kivan\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai\api_requestor.py", line 619, in _interpret_response
    self._interpret_response_line(
  File "C:\Users\kivan\AppData\Local\Programs\Python\Python311\Lib\site-packages\openai\api_requestor.py", line 682, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 9208 tokens (9208 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

vaknin avatar Apr 18 '23 08:04 vaknin

@vaknin after which command do you get that error?

Pwuts avatar Apr 18 '23 13:04 Pwuts

@vaknin after which command do you get that error?

COMMAND = google

vaknin avatar Apr 18 '23 13:04 vaknin

@vaknin this PR is a fix for ingesting files, so your comment is unrelated.

Pwuts avatar Apr 18 '23 15:04 Pwuts

@vzla0094 CI is red now

nponeccop avatar Apr 18 '23 19:04 nponeccop

I started playing around with using Auto-GPT to inject a large text file and ran into this issue. I don't know much about tokenize, so I looked around and saw transformers had a way of doing it, but I had to pug in a model and add another import. I don't like it, it also keeps looping through the file, but I cant keep playing with it today so I figured I'd leave it out there if others were having problems. I found the issue to be in llm_utils.py - but the fact I'm getting a loop after the fix below may mean that its ingested somewhere else maybe?? In any case, I got the error to stop (openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens).

Sorry, wish I had more time! Here's what I changed (only the create_embedding_with_ada function):

def create_embedding_with_ada(text) -> list:
    """Create an embedding with amodel Not sure which yet using the OpenAI SDK"""
    from transformers import AutoTokenizer

    max_context_length = 8000

    # Tokenize the text using the text-ada-002 tokenizer
    ada_tokenizer = AutoTokenizer.from_pretrained("openai-gpt", use_fast=True)
    tokens = ada_tokenizer.encode(text, return_tensors="pt", add_special_tokens=False).squeeze().tolist()

    # Truncate the tokens to the model's maximum context length
    tokens = tokens[:max_context_length]

    # Convert token chunks back to text
    truncated_text = ada_tokenizer.decode(tokens, skip_special_tokens=True)

    num_retries = 10
    for attempt in range(num_retries):
        backoff = 2 ** (attempt + 2)
        try:
            if CFG.use_azure:
                embedding = openai.Embedding.create(
                    input=[truncated_text],
                    engine=CFG.get_azure_deployment_id_for_model(
                        "text-embedding-ada-002"
                    ),
                )["data"][0]["embedding"]
            else:
                embedding = openai.Embedding.create(
                    input=[truncated_text], model="text-embedding-ada-002"
                )["data"][0]["embedding"]
            return embedding
        except RateLimitError:
            pass
        except APIError as e:
            if e.http_status == 502:
                pass
            else:
                raise
            if attempt == num_retries - 1:
                raise
        if CFG.debug_mode:
            print(
                Fore.RED + "Error: ",
                f"API Bad gateway. Waiting {backoff} seconds..." + Fore.RESET,
            )
        time.sleep(backoff)

s0meguy1 avatar Apr 18 '23 19:04 s0meguy1

@vzla0094 CI is red now

@Pwuts @nponeccop

Just seeing this, I'll push updates addressing the suggestions and fixing the CI

vzla0094 avatar Apr 18 '23 20:04 vzla0094

You can fix the linting errors with python -m black . && python -m isort .

Pwuts avatar Apr 19 '23 00:04 Pwuts

Hey @vzla0094 -

Are you sure the issue is in text.py? The error I get with both the original code and updated code is in llm_utils.py:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/user/Auto-GPT/autogpt/__main__.py", line 5, in <module>
    autogpt.cli.main()
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 1635, in invoke
    rv = super().invoke(ctx)
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/user/.local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/user/Auto-GPT/autogpt/cli.py", line 121, in main
    agent.start_interaction_loop()
  File "/home/user/Auto-GPT/autogpt/agent/agent.py", line 184, in start_interaction_loop
    self.memory.add(memory_to_add)
  File "/home/user/Auto-GPT/autogpt/memory/local.py", line 76, in add
    embedding = create_embedding_with_ada(text)
  File "/home/user/Auto-GPT/autogpt/llm_utils.py", line 155, in create_embedding_with_ada
    return openai.Embedding.create(
  File "/home/user/.local/lib/python3.10/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
  File "/home/user/.local/lib/python3.10/site-packages/openai/api_requestor.py", line 226, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "/home/user/.local/lib/python3.10/site-packages/openai/api_requestor.py", line 619, in _interpret_response
    self._interpret_response_line(
  File "/home/user/.local/lib/python3.10/site-packages/openai/api_requestor.py", line 682, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 34798 tokens (34798 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

s0meguy1 avatar Apr 19 '23 00:04 s0meguy1

I emptied all the functions in text.py and just added print statements (print("HERE") - just so the rest of the app started correctly) just to make sure the text chunks didn't pass through them. Started fine, started the process of loading the large text file, and failed again with the same error. Unless this thread is doing something differently, I don't think text.py is the answer - I'll keep poking at llm_utils.py and see if I can get it working

s0meguy1 avatar Apr 19 '23 00:04 s0meguy1

@s0meguy1 I'm not sure about any other failing function but this split_text is definitely one of the causes. One thing you could do to find out is running the original split_text in the master branch and run it against the test I added in this PR, you'll see how buggy it is

vzla0094 avatar Apr 19 '23 01:04 vzla0094

@vaknin @Pwuts @s0meguy1 I think I know why the confusion, I might have linked the wrong issues here but this PR fixes the split_text function that's used for the web scrapper/browser command. Not file ingesting, neither google searching

vzla0094 avatar Apr 19 '23 01:04 vzla0094

@vzla0094 doesn't look too hard to refactor file_operations.py to use processing/text.py > split_text() https://github.com/Significant-Gravitas/Auto-GPT/blob/master/autogpt/commands/file_operations.py#L52

Pwuts avatar Apr 19 '23 01:04 Pwuts

@Pwuts it does look like an easy fix hahah but don't want to risk having to spend more time on this if case some edge case arises lol.

If this one is merged I might find some time tomorrow to do the quick fix of the other one:)

vzla0094 avatar Apr 19 '23 01:04 vzla0094

This PR splits the text based on character count, not token count. It also splits in the middle of a sentence.

Can I recommend that you take a look at #2542 , which solves these issues?

bszollosinagy avatar Apr 19 '23 14:04 bszollosinagy

If you want, you can just merge this PR, after all, vzla0094 put some work into it, and then I'll just adjust my PR to make the additional changes on top of it.

bszollosinagy avatar Apr 19 '23 15:04 bszollosinagy

If you want, you can just merge this PR, after all, vzla0094 put some work into it, and then I'll just adjust my PR to make the additional changes on top of it.

Whatever's best for everyone 🤷‍♂️ but yeah I think you might want to use the tests at least. Nice job on your PR btw

vzla0094 avatar Apr 19 '23 17:04 vzla0094

@vzla0094 we'll merge #2542 for the upcoming release and cherry pick your test, probably soon after. Thanks a lot for the work, and sorry for having you do all of it before (partially) turning it down. 😅

Pwuts avatar Apr 19 '23 21:04 Pwuts

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions[bot] avatar Apr 19 '23 21:04 github-actions[bot]

This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

p-i- avatar May 05 '23 00:05 p-i-