mem0 icon indicating copy to clipboard operation
mem0 copied to clipboard

Fixes to null data results and openAI embedding limits

Open parzival418 opened this issue 1 year ago • 1 comments

Description

I was trying to load in a large sitemap and hit a 403 from OpenAI. Turns out that there are limitations on the embeddings endpoints:

  • The input parameter may not take a list longer than 2048 elements (chunks of text).
  • The total number of tokens across all list elements of the input parameter cannot exceed 1,000,000. (Because the rate limit is 1,000,000 tokens per minute.)
  • Each individual array element (chunk of text) cannot be more than 8191 tokens.

Discussion can be found here: https://github.com/openai/openai-python/issues/519#issuecomment-1636921388

This fixes the first limit breaking the documents in batches of 2048 elements in the array being sent.

There is also a small fix here for issues that occur when the weaviate provider doesn't retrieve any results.

Type of change

Please delete options that are not relevant.

  • [ X] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] Refactor (does not change functionality, e.g. code style improvements, linting)
  • [ ] Documentation update

How Has This Been Tested?

I was trying to load this sitemap:

https://zonos.com/sitemap.xml

It will break under current non-batching implementation against the OpenAI API. Fix runs it cleanly and it works. General usage and operations so far have been running fine with the change.

Please delete options that are not relevant.

Checklist:

  • [ ] My code follows the style guidelines of this project
  • [x ] I have performed a self-review of my own code
  • [x ] I have commented my code, particularly in hard-to-understand areas
  • [x ] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [x ] New and existing unit tests pass locally with my changes
  • [x ] I have checked my code and corrected any misspellings

Maintainer Checklist

  • [ ] closes #xxxx (Replace xxxx with the GitHub issue number)
  • [ ] Made sure Checks passed

parzival418 avatar Feb 04 '24 20:02 parzival418

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (819650a) 56.60% compared to head (05ff62c) 56.64%. Report is 3 commits behind head on main.

Files Patch % Lines
embedchain/embedchain.py 78.57% 3 Missing :warning:
embedchain/vectordb/weaviate.py 50.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1238      +/-   ##
==========================================
+ Coverage   56.60%   56.64%   +0.03%     
==========================================
  Files         146      146              
  Lines        5923     5937      +14     
==========================================
+ Hits         3353     3363      +10     
- Misses       2570     2574       +4     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 06 '24 20:02 codecov[bot]

Going to merge this PR for now and will do the improvements in a follow up PR. Thanks @michaelsharpe for the contribution.

deshraj avatar Feb 11 '24 23:02 deshraj