[Bug]: ERROR:root:Fail to bind LLM used by Knowledge Graph: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA'
Is there an existing issue for the same bug?
- [x] I have checked the existing issues.
RAGFlow workspace code commit ID
null
RAGFlow image version
v0.16.0-196-g3b30799b slim
Other environment information
docker debian12
Actual behavior
ERROR:root:Fail to bind LLM used by Knowledge Graph: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA'
Traceback (most recent call last):
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 807, in tokenize
self._scan()
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 832, in _scan
self._scan_keywords()
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 944, in _scan_keywords
if self._scan_string(word):
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 1063, in _scan_string
self._advance(len(quote))
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 858, in _advance
self._char = self.sql[self._current - 1]
IndexError: string index out of range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 548, in do_handle_task
run_graphrag(task, chat_model, task_language, embedding_model, progress_callback)
File "/ragflow/rag/svr/task_executor.py", line 473, in run_graphrag
Dealer(LightKGExt if row["parser_config"]["graphrag"]["method"] != 'general' else GeneralKGExt,
File "/ragflow/graphrag/general/index.py", line 54, in init
ents, rels = ext(chunks, callback)
File "/ragflow/graphrag/general/extractor.py", line 132, in call
n = t.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/ragflow/graphrag/general/extractor.py", line 162, in _merge_nodes
already_node = self._get_entity_(entity_name)
File "/ragflow/graphrag/utils.py", line 246, in get_entity
es_res = settings.retrievaler.search(conds, search.index_name(tenant_id), [kb_id])
File "<@beartype(rag.nlp.search.Dealer.search) at 0x7f922d20be20>", line 95, in search
File "/ragflow/rag/nlp/search.py", line 95, in search
res = self.dataStore.search(src, [], filters, [], orderBy, offset, limit, idx_names, kb_ids)
File "/ragflow/rag/utils/infinity_conn.py", line 399, in search
builder.filter(filter_cond)
File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/table.py", line 390, in filter
self.query_builder.filter(filter)
File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/query_builder.py", line 329, in filter
where_expr = traverse_conditions(condition(where))
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4654, in condition
return maybe_parse( # type: ignore
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4284, in maybe_parse
return sqlglot.parse_one(sql, read=dialect, into=into, **opts)
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/init.py", line 148, in parse_one
result = dialect.parse_into(into, sql, **opts)
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 168, in parse_into
return self.parser(**opts).parse_into(expression_type, self.tokenize(sql), sql)
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 177, in tokenize
return self.tokenizer.tokenize(sql)
File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 814, in tokenize
raise ValueError(f"Error tokenizing '{context}'") from e
ValueError: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA'
Expected behavior
Create a knowledge graph
Steps to reproduce
Create a knowledge graph with paper and graphrag, If there are quotation marks on the PDF, this error will be triggered. It should be caused by quotation marks as special characters, but the PDF cannot be changed. Do you need to handle quotation marks specifically
Additional information
No response
Give it a try: using General as chunking method.
i tried book as well as general. spent 2 day and it still doesnt work.
14:19:51 [ERROR]Fail to bind LLM used by Knowledge Graph: Error tokenizing 'N ('entity') AND entity_kwd='CONWAY'S CONSULTANT' 14:19:51 [ERROR][Exception]: Error tokenizing 'N ('entity') AND entity_kwd='CONWAY'S CONSULTANT'
I'm having a similar issue and it looks like it's supposed to be the reason why tokenizing isn't parsed properly when there are multiple single quotes (').
Is there any solution?
Do you have back end erro log?
I have similar issue due to the single quote:
| Traceback (most recent call last):
| File "/ragflow/graphrag/general/extractor.py", line 159, in _merge_nodes
| already_node = self._get_entity_(entity_name)
| File "/ragflow/graphrag/utils.py", line 274, in get_entity
| es_res = settings.retrievaler.search(conds, search.index_name(tenant_id), [kb_id])
| File "<@beartype(rag.nlp.search.Dealer.search) at 0x75fc204f6320>", line 95, in search
| File "/ragflow/rag/nlp/search.py", line 95, in search
| res = self.dataStore.search(src, [], filters, [], orderBy, offset, limit, idx_names, kb_ids)
| File "/ragflow/rag/utils/infinity_conn.py", line 401, in search
| builder.filter(filter_cond)
| File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/table.py", line 390, in filter
| self.query_builder.filter(filter)
| File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/query_builder.py", line 329, in filter
| where_expr = traverse_conditions(condition(where))
| File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4654, in condition
| return maybe_parse( # type: ignore
| File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4284, in maybe_parse
| return sqlglot.parse_one(sql, read=dialect, into=into, **opts)
| File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/__init__.py", line 148, in parse_one
| result = dialect.parse_into(into, sql, **opts)
| File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 168, in parse_into
| return self.parser(**opts).parse_into(expression_type, self.tokenize(sql), sql)
| File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 177, in tokenize
| return self.tokenizer.tokenize(sql)
| File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 814, in tokenize
| raise ValueError(f"Error tokenizing '{context}'") from e
| ValueError: Error tokenizing ' AND entity_kwd='REGIONAL EUROPE'S MAJOR MARKETS'
It's the same error as above. I didn't have a hance to copy it and switched back to qwen 2.5 as I was in a hurry. QWEN works fine.
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Kevin Hu @.> Sent: Monday, March 10, 2025 2:48:53 PM To: infiniflow/ragflow @.> Cc: stevenguan08 @.>; Comment @.> Subject: Re: [infiniflow/ragflow] [Bug]: ERROR:root:Fail to bind LLM used by Knowledge Graph: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA' (Issue #5520)
Do you have back end erro log?
— Reply to this email directly, view it on GitHubhttps://github.com/infiniflow/ragflow/issues/5520#issuecomment-2709590244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGTQUCHPP3J4G3ANASFJ2V32TUYVLAVCNFSM6AAAAABYEEACECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBZGU4TAMRUGQ. You are receiving this because you commented.Message ID: @.***>
[KevinHuSh]KevinHuSh left a comment (infiniflow/ragflow#5520)https://github.com/infiniflow/ragflow/issues/5520#issuecomment-2709590244
Do you have back end erro log?
— Reply to this email directly, view it on GitHubhttps://github.com/infiniflow/ragflow/issues/5520#issuecomment-2709590244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGTQUCHPP3J4G3ANASFJ2V32TUYVLAVCNFSM6AAAAABYEEACECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBZGU4TAMRUGQ. You are receiving this because you commented.Message ID: @.***>
Thank you @stevenguan08. Switching from ministral:3b to qwen2.5:3b solved the error.