[Bug]: RAPTOR and
Is there an existing issue for the same bug?
- [x] I have checked the existing issues.
RAGFlow workspace code commit ID
9298acc full - Nighty Feb 18
RAGFlow image version
9298acc full
Other environment information
Running nightly build from Feb 18 on ubuntu
Actual behavior
Hi, I have several issues similar to what others reported but not quite the same. Issues with RAPTOR: 1) On this one doc with only 1 chunk it always errors our:
15:10:00 Task has been received. 15:10:01 Page(1~2): OCR started 15:10:04 Page(1~2): OCR finished (3.11s) 15:10:05 Page(1~2): Layout analysis (0.86s) 15:10:05 Page(1~2): Table analysis (0.00s) 15:10:05 Page(1~2): Text merged (0.06s) 15:10:05 Page(1~2): Page 0~1: Text merging finished 15:10:05 Page(1~2): Generate 1 chunks 15:10:05 Page(1~2): Embedding chunks (0.22s) 15:10:05 Page(1~2): Done (0.04s) 15:10:08 Start RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval). 15:10:08 Task has been received. 15:10:08 [ERROR]Fail to bind LLM used by RAPTOR: 'NoneType' object is not subscriptable 15:10:08 [ERROR][Exception]: 'NoneType' object is not subscriptable
I can find chunk in elasticsearch.
2) On another document it processed fine, then I changed a setting on the file to enable Entity resolution and re-run it and got an error.
16:01:12 Reused previous task's chunks. 16:01:17 Start RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval). 16:01:17 Task has been received. 16:01:18 [ERROR]Fail to bind LLM used by RAPTOR: 'NoneType' object is not subscriptable 16:01:18 [ERROR][Exception]: 'NoneType' object is not subscriptable
Then I turned that setting off and re-run and RAPTOR worked fine again. This time it had to re-generate chunks since it's error-ed before. Simmilar for another document I chose not to regenerate chunks and it failed RAPTOR, they I re-generated and RAPTOR worked.
I see that I have some errors connecting to the elasticsearch
ESConnection.update got exception: BadRequestError(400, 'illegal_argument_exception', 'exceeded max allowed inline script size in bytes [65535] with size [213572] for script [ctx._source.content_with_weight='
Before I had errors regarding number of scripts that can be run and I increased it to 1000/1m
Could be related to how many entities it found and trying to resolve?
3) Tasks seems to be stuck at very last step after entity resolution for a really long time, for an hour or more for example. 18:57:39 Entities extraction progress ... 46/47 (8962 tokens) 18:57:39 Entities extraction progress ... 47/47 (9589 tokens) then some times it fails.
Thank you.
Expected behavior
No response
Steps to reproduce
As described above. Using documents with RAPTOR and entity resolution.
Additional information
No response
About 15:10:08 [ERROR][Exception]: 'NoneType' object is not subscriptable, do you have back end error log?
docker logs -f ragflow-server
For illegal_argument_exception:
curl -X POST -u elastic:infini_rag_flow -H 'Content-Type: application/json' http://127.0.0.1:1200/_cluster/settings -d '{"transient":{"script.max_size_in_bytes": 10000000}}'
1. For 15:10:08 [ERROR][Exception]: 'NoneType' object is not subscriptable
functools.partial(<function set_progress at 0x794cf594b7f0>, '47c8a1b2f08711ef817e0242ac120007', 100000000, 100000000)
` }
if row["pagerank"]:
doc[PAGERANK_FLD] = int(row["pagerank"])
res = []
tk_count = 0
for content, vctr in chunks[original_length:]:
d = copy.deepcopy(doc)
d["id"] = xxhash.xxh64((content + str(d["doc_id"])).encode("utf-8")).hexdigest()
d["create_time"] = str(datetime.now()).replace("T", " ")[:19]
d["create_timestamp_flt"] = datetime.now().timestamp()
d[vctr_nm] = vctr.tolist()`
2. For 'NoneType' object has no attribute 'strip'
handle_task got exception for task {"id": "ff21e0a6f16f11efa8db0242ac120007", "doc_id": "ca0f4d84ebe511ef801f0242ac120007", "from_page": 0, "to_page": 100000000, "retry_count": 0, "kb_id": "283fb782ebe511efb7e50242ac120007", "parser_id": "naive", "parser_config": {"auto_keywords": 2, "auto_questions": 0, "raptor": {"use_raptor": true, "prompt": "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n {cluster_content}\nThe above is the content you need to summarize.", "max_token": 256, "threshold": 0.1, "max_cluster": 64, "random_seed": 0}, "graphrag": {"use_graphrag": true, "entity_types": ["organization", "person", "geo", "event", "category", "procedure", "drug", "sample", "biological sample", "organ", "chemical", "document", "process", "regulatory body", "consumables", "medication", "equipment"], "method": "light", "resolution": true}, "chunk_token_num": 128, "delimiter": "\n!?;\u3002\uff1b\uff01\uff1f", "pages": []}, "name": "Secti...
`rag/llm/chat_model.py in chat at line 50 In App
try:
response = self.client.chat.completions.create(
model=self.model_name,
messages=history,
**gen_conf)
ans = response.choices[0].message.content.strip()
if response.choices[0].finish_reason == "length":
if is_chinese(ans):
ans += LENGTH_NOTIFICATION_CN
else:
ans += LENGTH_NOTIFICATION_EN`
** 3. Increasing script size helped with this error, thank you, but now I'm getting another error**
`ESConnection.update got exception: BadRequestError(400, 'script_exception', 'compile error')ctx._source.content_with_weight='{ "directed": false, "multigraph": false, "graph": {}, "nodes": [ { "entity_type": "ORGANIZATION", "rank": 56, "pagerank": 0.007729082736942685, "id": ....
I tried to increase size even more, not sure if that's the solution.
Increasing script size didn't help, still getting
ESConnection.update got exception: BadRequestError(400, 'script_exception', 'compile error')
Thanks.
Latest update seem to fixed most of these issues.
Still have issues with really long processing of documents.
Tasks seems to be stuck at the last step