mem0 Addition of CSV files throwing error

🐛 Describe the bug

I have tried adding all the data sources for my project which work fine. But adding a csv file stored in local throws InvalidRequestError. Is there some issue with my syntax or if this needs some fix?

company_strategy_bot.add('/Users/shashwat/POCs/POCEmbedchain/NikeReports/Nike.csv', data_type="csv")

Error in console

File "/Users/shashwat/POCs/POCEmbedchain/strategy_automation_embedchain.py", line 25, in fetch_company_details
    company_strategy_bot.add('/Users/shashwat/POCs/POCEmbedchain/NikeReports/Nike.csv', data_type="csv")
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/embedchain/embedchain.py", line 179, in add
    documents, _metadatas, _ids, new_chunks = self.load_and_embed(
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/embedchain/embedchain.py", line 301, in load_and_embed
    self.db.add(documents=documents, metadatas=metadatas, ids=ids)
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/embedchain/vectordb/chroma_db.py", line 119, in add
    self.collection.add(documents=documents, metadatas=metadatas, ids=ids)
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 95, in add
    ids, embeddings, metadatas, documents = self._validate_embedding_set(
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 386, in _validate_embedding_set
    embeddings = self._embedding_function(documents)
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/chromadb/utils/embedding_functions.py", line 131, in __call__
    embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_requestor.py", line 765, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.````

Sep 09 '23 12:09 harsh15793

Please use dry_run=True in the add method and let us know what it says

Sep 09 '23 16:09 cachho

TypeError: EmbedChain.add() got an unexpected keyword argument 'dry_run'

Getting above error on below command company_strategy_bot.add('/Users/shashwat/POCs/POCEmbedchain/NikeReports/Nike.csv', data_type="csv", dry_run=True)

Sep 09 '23 16:09 harsh15793

Sorry @harsh15793, forgot this isn't in the main branch yet. Sorry about that.

Sep 10 '23 22:09 cachho

Have you tried a shorter file? could it be that you're hitting a limit? Can you share the file? I can't reproduce the issue as it is. thanks.

Sep 11 '23 06:09 cachho

Thanks a ton, @cachho. Yes, I think the file size was the problem. Limiting the file data removes the error and the code works fine. The error which was thrown on the full file was quite misleading and it looked like the input data was invalid. Do we have any described limit on the file size so that we can handle it before the code runs?

Sep 11 '23 07:09 harsh15793

@harsh15793 this limit is set by the LLM, and it also depends on the model used, so it's hard to gatekeep on our end. We should at least add a note to the docs though.

Sep 11 '23 07:09 cachho

I have the same problem and don't know the solution yet. If there is no solution, then Embedchain is not the best option for handling large files

Sep 18 '23 11:09 essraahmed

I have the same problem and don't know the solution yet. If there is no solution, then Embedchain is not the best option for handling large files

as I said in a previous reply, this is a limit set by the LLM. The error is returned by the OpenAI endpoint, not Embedchain.

It's hard to combat this, since your document is different from the next guy, and each LLM is different.

#429 would certainly help with this, because then each row can be embedded individually.

I'll give you a sample code to chunk your csv.

import csv
import os

def csv_chunker(input_file, output_folder, rows_per_chunk=1000):
    """
    Split a CSV file into smaller chunks.

    :param input_file: Path to the CSV file to split.
    :param output_folder: Folder where to save the smaller chunks.
    :param rows_per_chunk: Number of rows in each chunk.
    """
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    with open(input_file, 'r') as csv_file:
        reader = csv.reader(csv_file)
        headers = next(reader)
        file_num = 1
        current_rows = []
        
        for row in reader:
            current_rows.append(row)
            
            if len(current_rows) == rows_per_chunk:
                output_path = os.path.join(output_folder, f'chunk_{file_num}.csv')
                with open(output_path, 'w', newline='') as output_csv:
                    writer = csv.writer(output_csv)
                    writer.writerow(headers)
                    writer.writerows(current_rows)
                file_num += 1
                current_rows = []

        # Save the last chunk if any rows are left
        if current_rows:
            output_path = os.path.join(output_folder, f'chunk_{file_num}.csv')
            with open(output_path, 'w', newline='') as output_csv:
                writer = csv.writer(output_csv)
                writer.writerow(headers)
                writer.writerows(current_rows)

if __name__ == '__main__':
    csv_chunker('path_to_large_csv.csv', 'out', 500)  # Change the path and rows_per_chunk as needed

then you can iterate through the out folder and try to add each file. If you still get the error, change the chunk size to a smaller size.

Sep 18 '23 13:09 cachho

Closing this as a solution has been given.

Jun 02 '24 06:06 Dev-Khant