llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

AssertionError: The batch size should not be larger than 2048.

Open pauldeden opened this issue 1 year ago • 10 comments

Using the following code, trying to load emails in csv format exported to a single file from Outlook I get the following error.

`import os from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader

def read_file(filepath): with open(filepath, 'r', encoding='utf-8') as infile: return infile.read()

os.environ["OPENAI_API_KEY"] = read_file('openaiapikey.txt')

if os.path.exists("data/emailindex.json"): # load from disk index = GPTSimpleVectorIndex.load_from_disk('data/emailindex.json') else: documents = SimpleDirectoryReader('data/input').load_data() index = GPTSimpleVectorIndex(documents) # save to disk index.save_to_disk('data/emailindex.json')

while True: prompt = input("Prompt: ") response = index.query(prompt) print(response) `

Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity_init_.py", line 409, in call result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.

pauldeden avatar Feb 22 '23 20:02 pauldeden

Hi, this should be using text-embedding-ada-002 right? (the batch size should be 8k tokens)

In any case, try setting chunk_size_limit to a smaller value when you build the index index = GPTSimpleVectorIndex(docs, ..., chunk_size_limit=512)

jerryjliu avatar Feb 24 '23 22:02 jerryjliu

Thank you, @jerryjliu.

I made that change index = GPTSimpleVectorIndex(documents, chunk_size_limit=512) and got the following error.

PS C:\Users\paul.eden\Code\llm-emails> python .\emailchat.py Traceback (most recent call last): File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\tenacity_init_.py", line 409, in call result = fn(*args, **kwargs) File "C:\Users\paul.eden\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\llama_index\embeddings\openai.py", line 123, in get_embeddings assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048." AssertionError: The batch size should not be larger than 2048.

pauldeden avatar Feb 25 '23 06:02 pauldeden

@jerryjliu @pauldeden I have tried on my document as well, It was giving the same error assert len(list_of_text) <= 2048. I have tried text-davinci-001 model for it , as in the above comment it was mentioned error might be because of text-embedding-ada-002.. When I reduced the size of document deleted 70% of rows and just used 30% of rows i.e 11k , I was able to train the model. I am using the following code to train my model. Requesting you kindly have to look so we can train on large corous of data. def construct_index(directory_path):

set maximum input size

max_input_size = 4096

set number of output tokens

num_outputs = 256

set maximum chunk overlap

max_chunk_overlap = 20

set chunk size limit

chunk_size_limit = 600

prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

define LLM

llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-001", max_tokens=num_outputs))

documents = SimpleDirectoryReader("directory_path").load_data()

index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper,chunk_size_limit=512)

index.save_to_disk('index_davinci.json')

return index

bpkapkar avatar Feb 27 '23 18:02 bpkapkar

Having the same problem here, any solutions yet?

maccarini avatar Mar 03 '23 03:03 maccarini

+1 on the problem.

satpalsr avatar Mar 05 '23 05:03 satpalsr

Hi @pauldeden @maccarini @satpalsr, thanks for raising. going to look into this a bit more today!

jerryjliu avatar Mar 06 '23 21:03 jerryjliu

I'm getting a similar error when I'm inserting large CSV files. Is there a theoretical limit to the size of a single file?

playztag avatar Mar 12 '23 01:03 playztag

i encountered the same issue when dealing with a large txt file.

Brightchu avatar Mar 14 '23 08:03 Brightchu

I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.

playztag avatar Mar 15 '23 00:03 playztag

I got around the above issue by breaking down the files to approx 4mb or smaller. I had several hundred megabytes of CSV to feed the model (historical email exports) that I had to break down into very small chunks.

Thanks. This worked for me. My original file is about 4MB; I had to split it into 1MB files to get it to work.

81jpayne avatar Mar 19 '23 06:03 81jpayne

closing this issue for now as it should be fixed in newer versions of llama index

logan-markewich avatar Jun 02 '23 18:06 logan-markewich