llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

How to fixed "ValueError: A single term is larger than the allowed chunk size."

Open yirenkeji555 opened this issue 2 years ago • 25 comments

Error msg: ValueError: A single term is larger than the allowed chunk size. Term size: 510 Chunk size: 379Effective chunk size: 379

image

yirenkeji555 avatar Feb 16 '23 02:02 yirenkeji555

@yirenkeji555 what version are you on?

jerryjliu avatar Feb 16 '23 06:02 jerryjliu

I just cloned the master branch code last night, so it should be the latest version of the code.

yirenkeji555 avatar Feb 16 '23 07:02 yirenkeji555

Got it; would you be able to send me some sample code to reproduce?

jerryjliu avatar Feb 17 '23 06:02 jerryjliu

OK, I just use new content to replace file content in "examples/test_wiki/data/nyc_text.txt".

Attachment is my new content.

nyc_text.txt

yirenkeji555 avatar Feb 17 '23 06:02 yirenkeji555

Comment " request wiki content write to nyc_text.txt" code when you run "examples/test_wiki/TestNYC.ipynb".

yirenkeji555 avatar Feb 17 '23 06:02 yirenkeji555

hi @yirenkeji555, apologies for the delay. taking a look now

jerryjliu avatar Mar 06 '23 22:03 jerryjliu

I have upgraded llama_index to version 0.4.21 and I am still experiencing some issues. Additionally, I was wondering if you could provide me with some guidance on which method would be best suited to load the documents?

MZhao-ouo avatar Mar 07 '23 04:03 MZhao-ouo

same issue met, with SimpleDirectoryReader and GPTTreeIndex.

liyuankui avatar Mar 07 '23 05:03 liyuankui

It might be a Chinese only issue. I replaced all as , now it works. One possible reason is that Chinese is NOT separated by spaces like English or other latin languages.

liyuankui avatar Mar 07 '23 06:03 liyuankui

This should be fixed with the upcoming release! I took a stab in #628 #629 - we hard force splits if the default separator (space) doesn't work

jerryjliu avatar Mar 07 '23 06:03 jerryjliu

though of course if @yirenkeji555 if you have sample data for me to repo, i can verify

jerryjliu avatar Mar 07 '23 06:03 jerryjliu

Same problem in 0.4.21

e0v1a avatar Mar 07 '23 09:03 e0v1a

though of course if @yirenkeji555 if you have sample data for me to repo, i can verify

same issue on 0.4.24. here is the sample data https://cdn.discordapp.com/attachments/899657146143232011/1083317634382180362/Bookandswordenmityrecord.txt

chun1617 avatar Mar 09 '23 09:03 chun1617

@chun1617 The link showing access denied.

blackstorm avatar Mar 10 '23 03:03 blackstorm

same issue on 0.4.24. here is the sample data

https://ethbill.s3.ap-southeast-1.amazonaws.com/1.txt @jerryjliu

hackhy avatar Mar 11 '23 09:03 hackhy

same issue on 0.4.26, here is the sample data temp.txt @jerryjliu

madawei2699 avatar Mar 12 '23 08:03 madawei2699

Hi @madawei2699, thanks for sharing! Will take a look at a fix verysoon

jerryjliu avatar Mar 12 '23 17:03 jerryjliu

same issue on 0.4.28. here is the sample data image

https://mp.weixin.qq.com/s/eAP2_TUyoF9uVjUb6cUWYw

Assazm avatar Mar 15 '23 06:03 Assazm

same issue 0.4.31

chrischjh avatar Mar 19 '23 06:03 chrischjh

Hey all,

sorry for the delay on this. Will take a look at sample data that some of you have provided and fix soon. In the meantime you can try manually setting the chunk size through chunk_size_limit when building the index

jerryjliu avatar Mar 19 '23 07:03 jerryjliu

Hi, is there any progress here?

gzrjzcx avatar Mar 20 '23 14:03 gzrjzcx

Any update for this? 🙏

chrischjh avatar Mar 23 '23 14:03 chrischjh

Currently, the encoding method used by globals_helper.tokenizer will result in a much higher token count when calculating the number of tokens in Chinese text. For example:

text = '唐高宗仪凤二年春天,六祖大师从广州法性寺来到曹溪南华山宝林寺,\
        韶州刺史韦璩和他的部属入山礼请六祖到城裡的大梵寺讲堂,为大众广开佛法因缘,演说法要。\
        六祖登坛陞座时,闻法的人有韦刺史和他的部属三十多人,以及当时学术界的领袖、学者等三十多人,\
        暨僧、尼、道、俗一千馀人,同时向六祖大师礼座,希望听闻佛法要义。'

encoding = tiktoken.encoding_for_model('gpt-3.5-turbo')
print(len(encoding.encode(text)))
# 218

encoding = tiktoken.get_encoding("gpt2")
print(len(encoding.encode(text)))
# 342

In the case of mixed Chinese , English and various symbols, the discrepancy in the calculation results can even reach several times. So, when I chandged the encoding method and reduced the size for each chunk, error was gone.

suling-fun avatar Mar 24 '23 16:03 suling-fun

I have the same problem:

ValueError: A single term is larger than the allowed chunk size.
Term size: 673
Chunk size: 500Effective chunk size: 500

The source code is as follows, and the data is shown in the attachment

# coding=utf8

from llama_index import GPTSimpleVectorIndex, Document
import os


def main():
    os.environ['OPENAI_API_KEY'] = 'sk-XX'
    os.environ['CURL_CA_BUNDLE'] = ''

    # Loading from strings, assuming you saved your data to strings text1, text2, ...
    with open(file="wiki_kuangbiao.txt", mode="r", encoding="utf-8") as f:
        text_wiki_kuangbiao = f.read()

    text_list = [text_wiki_kuangbiao, ]
    documents = [Document(t) for t in text_list]

    # Construct a simple vector index
    index = GPTSimpleVectorIndex(documents, chunk_size_limit=500)

    # Save your index to a index.json file
    index.save_to_disk('index_wiki_kuangbiao.json')

    # Load the index from your saved index.json file
    index = GPTSimpleVectorIndex.load_from_disk('index_wiki_kuangbiao.json')

    # Querying the index
    response = index.query("《狂飙》男一号是?")

    print(response)


if __name__ == '__main__':
    main()

wiki_kuangbiao.txt

yaohongfenglove avatar Mar 25 '23 04:03 yaohongfenglove

i have to set chunk_size_limit to about 2000 to avoid this error, any better solutions?

snipersteve avatar Mar 26 '23 12:03 snipersteve

same problem

neove avatar Mar 29 '23 00:03 neove

I got same issue. And I fixed by this way. It work fine to me.

MAX_TEXT_INLINE = 100

def trim_text(text):
    """
    Trim text
    @param text:
    @return:
    """
    text = text.strip()
    text = re.sub('\s+', ' ', text)
    text = re.sub('\n+', '\n', text)

    return text


def limit_line_length(text):
    """
    Limit line length
    @param text:
    @return:
    """
    lines = []
    for line in text.split('\n'):
        chunks = [line[i:i+MAX_TEXT_INLINE] for i in range(0, len(line), MAX_TEXT_INLINE)]
        lines.extend(chunks)
    return '\n'.join(lines)

########################

documents = SimpleDirectoryReader(folder_path).load_data()
for document in documents:
    document.text = utils_service.limit_line_length(utils_service.trim_text(document.text))

loi-nguyen-khanh avatar Apr 05 '23 07:04 loi-nguyen-khanh

The problem is that default separator for splitting text in TokenTextSplitter is " ". But in languages like Chinese, there might be long text chunks without any whitespaces. So the easiest fix is to increase the chunk_size_limit. Or you need to roll out your own text preprocessor, for example:

for doc in documents:
    # replace all Chinese periods to add whitespaces
    doc.text = doc.text.replace("。", ". ")

zijie0 avatar Apr 20 '23 01:04 zijie0

is there any update on this one please? it is related to pull request #1354 please?

justinzyw avatar May 18 '23 05:05 justinzyw

姚宏锋已收到您的来信,谢谢。

yaohongfenglove avatar May 18 '23 05:05 yaohongfenglove