Addition of CSV files throwing error
🐛 Describe the bug
I have tried adding all the data sources for my project which work fine. But adding a csv file stored in local throws InvalidRequestError. Is there some issue with my syntax or if this needs some fix?
company_strategy_bot.add('/Users/shashwat/POCs/POCEmbedchain/NikeReports/Nike.csv', data_type="csv")
Error in console
File "/Users/shashwat/POCs/POCEmbedchain/strategy_automation_embedchain.py", line 25, in fetch_company_details
company_strategy_bot.add('/Users/shashwat/POCs/POCEmbedchain/NikeReports/Nike.csv', data_type="csv")
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/embedchain/embedchain.py", line 179, in add
documents, _metadatas, _ids, new_chunks = self.load_and_embed(
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/embedchain/embedchain.py", line 301, in load_and_embed
self.db.add(documents=documents, metadatas=metadatas, ids=ids)
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/embedchain/vectordb/chroma_db.py", line 119, in add
self.collection.add(documents=documents, metadatas=metadatas, ids=ids)
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 95, in add
ids, embeddings, metadatas, documents = self._validate_embedding_set(
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/chromadb/api/models/Collection.py", line 386, in _validate_embedding_set
embeddings = self._embedding_function(documents)
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/chromadb/utils/embedding_functions.py", line 131, in __call__
embeddings = self._client.create(input=texts, engine=self._model_name)["data"]
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_resources/embedding.py", line 33, in create
response = super().create(*args, **kwargs)
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/Users/shashwat/POCs/POCEmbedchain/.venv/lib/python3.10/site-packages/openai/api_requestor.py", line 765, in _interpret_response_line
raise self.handle_error_response(
openai.error.InvalidRequestError: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.````
Please use dry_run=True in the add method and let us know what it says
TypeError: EmbedChain.add() got an unexpected keyword argument 'dry_run'
Getting above error on below command
company_strategy_bot.add('/Users/shashwat/POCs/POCEmbedchain/NikeReports/Nike.csv', data_type="csv", dry_run=True)
Sorry @harsh15793, forgot this isn't in the main branch yet. Sorry about that.
Have you tried a shorter file? could it be that you're hitting a limit? Can you share the file? I can't reproduce the issue as it is. thanks.
Thanks a ton, @cachho. Yes, I think the file size was the problem. Limiting the file data removes the error and the code works fine. The error which was thrown on the full file was quite misleading and it looked like the input data was invalid. Do we have any described limit on the file size so that we can handle it before the code runs?
@harsh15793 this limit is set by the LLM, and it also depends on the model used, so it's hard to gatekeep on our end. We should at least add a note to the docs though.
I have the same problem and don't know the solution yet. If there is no solution, then Embedchain is not the best option for handling large files
I have the same problem and don't know the solution yet. If there is no solution, then Embedchain is not the best option for handling large files
as I said in a previous reply, this is a limit set by the LLM. The error is returned by the OpenAI endpoint, not Embedchain.
It's hard to combat this, since your document is different from the next guy, and each LLM is different.
#429 would certainly help with this, because then each row can be embedded individually.
I'll give you a sample code to chunk your csv.
import csv
import os
def csv_chunker(input_file, output_folder, rows_per_chunk=1000):
"""
Split a CSV file into smaller chunks.
:param input_file: Path to the CSV file to split.
:param output_folder: Folder where to save the smaller chunks.
:param rows_per_chunk: Number of rows in each chunk.
"""
if not os.path.exists(output_folder):
os.makedirs(output_folder)
with open(input_file, 'r') as csv_file:
reader = csv.reader(csv_file)
headers = next(reader)
file_num = 1
current_rows = []
for row in reader:
current_rows.append(row)
if len(current_rows) == rows_per_chunk:
output_path = os.path.join(output_folder, f'chunk_{file_num}.csv')
with open(output_path, 'w', newline='') as output_csv:
writer = csv.writer(output_csv)
writer.writerow(headers)
writer.writerows(current_rows)
file_num += 1
current_rows = []
# Save the last chunk if any rows are left
if current_rows:
output_path = os.path.join(output_folder, f'chunk_{file_num}.csv')
with open(output_path, 'w', newline='') as output_csv:
writer = csv.writer(output_csv)
writer.writerow(headers)
writer.writerows(current_rows)
if __name__ == '__main__':
csv_chunker('path_to_large_csv.csv', 'out', 500) # Change the path and rows_per_chunk as needed
then you can iterate through the out folder and try to add each file. If you still get the error, change the chunk size to a smaller size.
Closing this as a solution has been given.