ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: ERROR:root:Fail to bind LLM used by Knowledge Graph: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA'

Open NiuStar opened this issue 10 months ago • 7 comments

Is there an existing issue for the same bug?

  • [x] I have checked the existing issues.

RAGFlow workspace code commit ID

null

RAGFlow image version

v0.16.0-196-g3b30799b slim

Other environment information

docker debian12

Actual behavior

ERROR:root:Fail to bind LLM used by Knowledge Graph: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA'

Traceback (most recent call last):

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 807, in tokenize

self._scan()

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 832, in _scan

self._scan_keywords()

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 944, in _scan_keywords

if self._scan_string(word):

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 1063, in _scan_string

self._advance(len(quote))

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 858, in _advance

self._char = self.sql[self._current - 1]

IndexError: string index out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "/ragflow/rag/svr/task_executor.py", line 548, in do_handle_task

run_graphrag(task, chat_model, task_language, embedding_model, progress_callback)

File "/ragflow/rag/svr/task_executor.py", line 473, in run_graphrag

Dealer(LightKGExt if row["parser_config"]["graphrag"]["method"] != 'general' else GeneralKGExt,

File "/ragflow/graphrag/general/index.py", line 54, in init

ents, rels = ext(chunks, callback)

File "/ragflow/graphrag/general/extractor.py", line 132, in call

n = t.result()

File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result

return self.__get_result()

File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result

raise self._exception

File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run

result = self.fn(*self.args, **self.kwargs)

File "/ragflow/graphrag/general/extractor.py", line 162, in _merge_nodes

already_node = self._get_entity_(entity_name)

File "/ragflow/graphrag/utils.py", line 246, in get_entity

es_res = settings.retrievaler.search(conds, search.index_name(tenant_id), [kb_id])

File "<@beartype(rag.nlp.search.Dealer.search) at 0x7f922d20be20>", line 95, in search

File "/ragflow/rag/nlp/search.py", line 95, in search

res = self.dataStore.search(src, [], filters, [], orderBy, offset, limit, idx_names, kb_ids)

File "/ragflow/rag/utils/infinity_conn.py", line 399, in search

builder.filter(filter_cond)

File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/table.py", line 390, in filter

self.query_builder.filter(filter)

File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/query_builder.py", line 329, in filter

where_expr = traverse_conditions(condition(where))

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4654, in condition

return maybe_parse(  # type: ignore

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4284, in maybe_parse

return sqlglot.parse_one(sql, read=dialect, into=into, **opts)

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/init.py", line 148, in parse_one

result = dialect.parse_into(into, sql, **opts)

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 168, in parse_into

return self.parser(**opts).parse_into(expression_type, self.tokenize(sql), sql)

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 177, in tokenize

return self.tokenizer.tokenize(sql)

File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 814, in tokenize

raise ValueError(f"Error tokenizing '{context}'") from e

ValueError: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA'

Expected behavior

Create a knowledge graph

Steps to reproduce

Create a knowledge graph with paper and graphrag, If there are quotation marks on the PDF, this error will be triggered. It should be caused by quotation marks as special characters, but the PDF cannot be changed. Do you need to handle quotation marks specifically

Additional information

No response

NiuStar avatar Mar 01 '25 14:03 NiuStar

Give it a try: using General as chunking method.

KevinHuSh avatar Mar 03 '25 04:03 KevinHuSh

i tried book as well as general. spent 2 day and it still doesnt work.

Image

stevenguan08 avatar Mar 09 '25 07:03 stevenguan08

14:19:51 [ERROR]Fail to bind LLM used by Knowledge Graph: Error tokenizing 'N ('entity') AND entity_kwd='CONWAY'S CONSULTANT' 14:19:51 [ERROR][Exception]: Error tokenizing 'N ('entity') AND entity_kwd='CONWAY'S CONSULTANT'

I'm having a similar issue and it looks like it's supposed to be the reason why tokenizing isn't parsed properly when there are multiple single quotes (').

Is there any solution?

LiJinHao999 avatar Mar 10 '25 06:03 LiJinHao999

Do you have back end erro log?

KevinHuSh avatar Mar 10 '25 06:03 KevinHuSh

I have similar issue due to the single quote:

| Traceback (most recent call last):
|   File "/ragflow/graphrag/general/extractor.py", line 159, in _merge_nodes
|     already_node = self._get_entity_(entity_name)
|   File "/ragflow/graphrag/utils.py", line 274, in get_entity
|     es_res = settings.retrievaler.search(conds, search.index_name(tenant_id), [kb_id])
|   File "<@beartype(rag.nlp.search.Dealer.search) at 0x75fc204f6320>", line 95, in search
|   File "/ragflow/rag/nlp/search.py", line 95, in search
|     res = self.dataStore.search(src, [], filters, [], orderBy, offset, limit, idx_names, kb_ids)
|   File "/ragflow/rag/utils/infinity_conn.py", line 401, in search
|     builder.filter(filter_cond)
|   File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/table.py", line 390, in filter
|     self.query_builder.filter(filter)
|   File "/ragflow/.venv/lib/python3.10/site-packages/infinity/remote_thrift/query_builder.py", line 329, in filter
|     where_expr = traverse_conditions(condition(where))
|   File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4654, in condition
|     return maybe_parse(  # type: ignore
|   File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/expressions.py", line 4284, in maybe_parse
|     return sqlglot.parse_one(sql, read=dialect, into=into, **opts)
|   File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/__init__.py", line 148, in parse_one
|     result = dialect.parse_into(into, sql, **opts)
|   File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 168, in parse_into
|     return self.parser(**opts).parse_into(expression_type, self.tokenize(sql), sql)
|   File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/dialects/dialect.py", line 177, in tokenize
|     return self.tokenizer.tokenize(sql)
|   File "/ragflow/.venv/lib/python3.10/site-packages/sqlglot/tokens.py", line 814, in tokenize
|     raise ValueError(f"Error tokenizing '{context}'") from e
| ValueError: Error tokenizing ' AND entity_kwd='REGIONAL EUROPE'S MAJOR MARKETS'

anthonyrabiaza avatar Mar 17 '25 10:03 anthonyrabiaza

It's the same error as above. I didn't have a hance to copy it and switched back to qwen 2.5 as I was in a hurry. QWEN works fine.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Kevin Hu @.> Sent: Monday, March 10, 2025 2:48:53 PM To: infiniflow/ragflow @.> Cc: stevenguan08 @.>; Comment @.> Subject: Re: [infiniflow/ragflow] [Bug]: ERROR:root:Fail to bind LLM used by Knowledge Graph: Error tokenizing 'Y OF TRANSPORT OF THE PEOPLE'S REPUBLIC OF CHINA' (Issue #5520)

Do you have back end erro log?

— Reply to this email directly, view it on GitHubhttps://github.com/infiniflow/ragflow/issues/5520#issuecomment-2709590244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGTQUCHPP3J4G3ANASFJ2V32TUYVLAVCNFSM6AAAAABYEEACECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBZGU4TAMRUGQ. You are receiving this because you commented.Message ID: @.***>

[KevinHuSh]KevinHuSh left a comment (infiniflow/ragflow#5520)https://github.com/infiniflow/ragflow/issues/5520#issuecomment-2709590244

Do you have back end erro log?

— Reply to this email directly, view it on GitHubhttps://github.com/infiniflow/ragflow/issues/5520#issuecomment-2709590244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGTQUCHPP3J4G3ANASFJ2V32TUYVLAVCNFSM6AAAAABYEEACECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMBZGU4TAMRUGQ. You are receiving this because you commented.Message ID: @.***>

stevenguan08 avatar Mar 17 '25 10:03 stevenguan08

Thank you @stevenguan08. Switching from ministral:3b to qwen2.5:3b solved the error.

anthonyrabiaza avatar Mar 17 '25 12:03 anthonyrabiaza