AutoGPT Stop ignoring small files in split

Background

The split_file function ignores small files due to the way it determines whether the last chunk has already been considered in the previous overlap. See #3753.

Changes

Added "and start > 0" to the relevant if statement to only ignore the potentially-overlapping chunk if it is not the first one.

Documentation

As there are no changes to the functionality, merely an edit to help the if condition match the intended purpose, the existing documentation comment should be sufficient.

Test Plan

See #3753 for steps to reproduce the issue.

This change does not seem to have impeded data_integration.py (which brought it to my attention) or the main AutoGPT file-reading functionality, based on about 30 min of testing.

Unless some code is reliant on getting no output from split_data for small functions, which did not seem to be the case in my grep of the repo, this change should not interact negatively with other code.

PR Quality Checklist

[x] My pull request is atomic and focuses on a single change.
[ ] I have thoroughly tested my changes with multiple different prompts.
[x] I have considered potential risks and mitigations for my changes.
[x] I have documented my changes clearly and comprehensively.
[x] I have not snuck in any "extra" small tweaks changes

This change does not add any new functionality, and only fixes previously-implemented behavior regarding the case where a file is less than the overlap threshold, so I have not added additional tests.

May 03 '23 19:05 swapneils

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 31, 2023 2:11pm

May 03 '23 19:05 vercel[bot]

This is a mass message from the AutoGPT core team. Our apologies for the ongoing delay in processing PRs. This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to: https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

May 05 '23 00:05 p-i-

I think it would be great if we have a test for that case.

May 10 '23 18:05 k-boikov

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.08 :tada:

Comparison is base (63b79a8) 69.65% compared to head (249d302) 69.74%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3757      +/-   ##
==========================================
+ Coverage   69.65%   69.74%   +0.08%     
==========================================
  Files          72       72              
  Lines        3523     3523              
  Branches      562      562              
==========================================
+ Hits         2454     2457       +3     
+ Misses        881      878       -3     
  Partials      188      188

Impacted Files	Coverage Δ
autogpt/commands/file_operations.py	`83.52% <100.00%> (ø)`

... and 2 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

May 13 '23 00:05 codecov[bot]

You changed AutoGPT's behaviour. The cassettes have been updated and will be merged to the submodule when this Pull Request gets merged.

May 31 '23 14:05 Auto-GPT-Bot

split_file is not used anywhere anymore, except in its unit test. I propose to remove the function altogether instead of patching it.

I haven’t opened AutoGPT in a while, but as I recall it is referenced in file_operations.py, which may be relevant in the future for processing local data.

And on the design side, I do feel that data_ingestion functionality (which does use split_file) will be an important functionality once the use of memory is more nuanced (see for instance the recent GPT Minecraft bot Voyager, which uses a memory of “skill” code; it seems useful to manually add elements to a skill/perspective/context/goal library from file data)

Jun 07 '23 23:06 swapneils

as I recall it is referenced in file_operations.py

That is where split_file is defined, but there is no usage anywhere in the project.

[...] data_ingestion functionality (which does use split_file)

data_ingestion.py imports ingest_file from file_operations.py, which in turn uses read_file and MemoryItem.from_text_file; split_file is not referenced in this process.

Jun 11 '23 22:06 Pwuts

And on the design side, I do feel that data_ingestion functionality (which does use split_file) will be an important functionality once the use of memory is more nuanced (see for instance the recent GPT Minecraft bot Voyager, which uses a memory of “skill” code; it seems useful to manually add elements to a skill/perspective/context/goal library from file data)

That is definitely something we're working on with R&D :)

Jun 11 '23 22:06 Pwuts

AutoGPT
AutoGPT copied to clipboard

Stop ignoring small files in split_file

Background

Changes

Documentation

Test Plan

PR Quality Checklist

Codecov Report

AutoGPT AutoGPT copied to clipboard

Stop ignoring small files in split_file

Background

Changes

Documentation

Test Plan

PR Quality Checklist

Codecov Report

AutoGPT
AutoGPT copied to clipboard