Fixes to null data results and openAI embedding limits
Description
I was trying to load in a large sitemap and hit a 403 from OpenAI. Turns out that there are limitations on the embeddings endpoints:
- The input parameter may not take a list longer than 2048 elements (chunks of text).
- The total number of tokens across all list elements of the input parameter cannot exceed 1,000,000. (Because the rate limit is 1,000,000 tokens per minute.)
- Each individual array element (chunk of text) cannot be more than 8191 tokens.
Discussion can be found here: https://github.com/openai/openai-python/issues/519#issuecomment-1636921388
This fixes the first limit breaking the documents in batches of 2048 elements in the array being sent.
There is also a small fix here for issues that occur when the weaviate provider doesn't retrieve any results.
Type of change
Please delete options that are not relevant.
- [ X] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Refactor (does not change functionality, e.g. code style improvements, linting)
- [ ] Documentation update
How Has This Been Tested?
I was trying to load this sitemap:
https://zonos.com/sitemap.xml
It will break under current non-batching implementation against the OpenAI API. Fix runs it cleanly and it works. General usage and operations so far have been running fine with the change.
Please delete options that are not relevant.
Checklist:
- [ ] My code follows the style guidelines of this project
- [x ] I have performed a self-review of my own code
- [x ] I have commented my code, particularly in hard-to-understand areas
- [x ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [x ] New and existing unit tests pass locally with my changes
- [x ] I have checked my code and corrected any misspellings
Maintainer Checklist
- [ ] closes #xxxx (Replace xxxx with the GitHub issue number)
- [ ] Made sure Checks passed
Codecov Report
Attention: 4 lines in your changes are missing coverage. Please review.
Comparison is base (
819650a) 56.60% compared to head (05ff62c) 56.64%. Report is 3 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| embedchain/embedchain.py | 78.57% | 3 Missing :warning: |
| embedchain/vectordb/weaviate.py | 50.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #1238 +/- ##
==========================================
+ Coverage 56.60% 56.64% +0.03%
==========================================
Files 146 146
Lines 5923 5937 +14
==========================================
+ Hits 3353 3363 +10
- Misses 2570 2574 +4
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Going to merge this PR for now and will do the improvements in a follow up PR. Thanks @michaelsharpe for the contribution.