llama_index
llama_index copied to clipboard
AssertionError: The batch size should not be larger than 2048.
Using the following code, trying to load emails in csv format exported to a single file from Outlook I get the following error.
`import os from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
def read_file(filepath): with open(filepath, 'r', encoding='utf-8') as infile: return infile.read()
os.environ["OPENAI_API_KEY"] = read_file('openaiapikey.txt')
if os.path.exists("data/emailindex.json"): # load from disk index = GPTSimpleVectorIndex.load_from_disk('data/emailindex.json') else: documents = SimpleDirectoryReader('data/input').load_data() index = GPTSimpleVectorIndex(documents) # save to disk index.save_to_disk('data/emailindex.json')
while True: prompt = input("Prompt: ") response = index.query(prompt) print(response) `
Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity_init_.py", line 409, in call result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.
Hi, this should be using text-embedding-ada-002 right? (the batch size should be 8k tokens)
In any case, try setting chunk_size_limit to a smaller value when you build the index index = GPTSimpleVectorIndex(docs, ..., chunk_size_limit=512)
Thank you, @jerryjliu.
I made that change index = GPTSimpleVectorIndex(documents, chunk_size_limit=512)
and got the following error.
PS C:\Users\paul.eden\Code\llm-emails> python .\emailchat.py Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity_init_.py", line 409, in call result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.
@jerryjliu @pauldeden I have tried on my document as well, It was giving the same error assert len(list_of_text) <= 2048. I have tried text-davinci-001 model for it , as in the above comment it was mentioned error might be because of text-embedding-ada-002.. When I reduced the size of document deleted 70% of rows and just used 30% of rows i.e 11k , I was able to train the model. I am using the following code to train my model. Requesting you kindly have to look so we can train on large corous of data. def construct_index(directory_path):
set maximum input size
max_input_size = 4096
set number of output tokens
num_outputs = 256
set maximum chunk overlap
max_chunk_overlap = 20
set chunk size limit
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)
define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-001", max_tokens=num_outputs))
documents = SimpleDirectoryReader("directory_path").load_data()
index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper,chunk_size_limit=512)
index.save_to_disk('index_davinci.json')
return index
Having the same problem here, any solutions yet?
+1 on the problem.
Hi @pauldeden @maccarini @satpalsr, thanks for raising. going to look into this a bit more today!
I'm getting a similar error when I'm inserting large CSV files. Is there a theoretical limit to the size of a single file?
i encountered the same issue when dealing with a large txt file.
I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.
I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.
Thanks. This worked for me. My original file is about 4MB; I had to split it into 1MB files to get it to work.
closing this issue for now as it should be fixed in newer versions of llama index