kotaemon icon indicating copy to clipboard operation
kotaemon copied to clipboard

[BUG] Error: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n'

Open gutama opened this issue 1 year ago • 1 comments

Description

I use ollama for llm and embedding lightrag, all the test connection is fine. when I upload text file it can do chunking and generating embedding but cannot do entity and relationship extraction

the error was [GraphRAG] Creating index... This can take a long time. [GraphRAG] Indexed 0 / 1 documents. Error: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n'

I don't have any other error info to solve the issue

image

image

Reproduction steps

Adding documents to doc store
indexing step took 0.11500287055969238
GraphRAG embedding dim 1024
Indexing GraphRAG with LLM ChatOpenAI(api_key=ollama, base_url=http://localhos..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=granite3.1-dense, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=ollama, base_url=http://localhos..., context_length=None, dimensions=None, max_retries=None, max_retries_=2, model=bge-m3, organization=None, timeout=None)...
Chunking documents: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.77doc/s]
Generating embeddings: 100%|███████████████████████████████████████████████| 4/4 [00:15<00:00,  3.76s/batch]
use_quick_index_mode False
reader_mode default
Chunk size: None, chunk overlap: None
Using reader TxtReader()
Got 0 page thumbnails
Adding documents to doc store
indexing step took 0.10607624053955078
GraphRAG embedding dim 768
Indexing GraphRAG with LLM ChatOpenAI(api_key=ollama, base_url=http://localhos..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=granite3.1-dense, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=ollama, base_url=http://localhos..., context_length=None, dimensions=None, max_retries=None, max_retries_=2, model=nomic-embed-text, organization=None, timeout=None)...
Chunking documents: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.83doc/s]
Generating embeddings: 100%|███████████████████████████████████████████████| 4/4 [00:11<00:00,  2.92s/batch]
use_quick_index_mode False
reader_mode default
Chunk size: None, chunk overlap: None
Using reader TxtReader()
Got 0 page thumbnails
Adding documents to doc store
indexing step took 0.11184024810791016
GraphRAG embedding dim 768
Indexing GraphRAG with LLM ChatOpenAI(api_key=ollama, base_url=http://localhos..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=granite3.1-dense, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=ollama, base_url=http://localhos..., context_length=None, dimensions=None, max_retries=None, max_retries_=2, model=nomic-embed-tex..., organization=None, timeout=None)...
Chunking documents: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.48doc/s]
Generating embeddings: 100%|███████████████████████████████████████████████| 4/4 [00:11<00:00,  2.92s/batch]

Screenshots

![image](https://github.com/user-attachments/assets/19a684cd-fb7c-432c-a9d8-a86cd353cea3)


![image](https://github.com/user-attachments/assets/7dcd0ba2-4640-4ffe-a783-c2e468bbf80f)

Logs

No response

Browsers

No response

OS

No response

Additional information

image

gutama avatar Jan 01 '25 14:01 gutama

I clean up all the ktem_app_data folder. It is running albeit very long time and didn't extract any relationship. Is there anything I missed?


Extracting entities from chunks: 100%|████████████████████████████████| 98/98 [3:12:57<00:00, 118.14s/chunk] Inserting entities: 100%|████████████████████████████████████████████████| 4/4 [00:00<00:00, 210.52entity/s] Inserting relationships: 0relationship [00:00, ?relationship/s] 2025-01-02T14:19:26.049446Z [warning ] Didn't extract any relationships asctime=2025-01-02 21:19:26,045 lineno=427 message=Didn't extract any relationships module=lightrag Generating embeddings: 100%|███████████████████████████████████████████████| 1/1 [00:13<00:00, 13.96s/batch] 2025-01-02T14:19:40.017763Z [warning ] You insert an empty data to vector DB asctime=2025-01-02 21:19:40,017 lineno=85 message=You insert an empty data to vector DB module=lightrag

gutama avatar Jan 02 '25 15:01 gutama

I'm getting the same error, upload info reports:

Indexing [1/1]: Overview.pdf
 => Converting Overview.pdf to text
 => Converted Overview.pdf to text
 => [Overview.pdf] Processed 8 chunks
 => Finished indexing Overview.pdf
[GraphRAG] Creating index... This can take a long time.
[GraphRAG] Indexed 0 / 4 documents.
Error: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n'

Console logs:

Using reader <kotaemon.loaders.pdf_loader.PDFThumbnailReader object at 0x7fc5c85dcdf0>
Page numbers: 4
Got 4 page thumbnails
Adding documents to doc store
indexing step took 0.515655517578125
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
GraphRAG embedding dim 1536
Indexing GraphRAG with LLM ChatOpenAI(api_key=kserve, base_url=http://212.189...., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=llama3, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=sk-proj-mORxO60..., base_url=https://api.ope..., context_length=8191, dimensions=None, max_retries=None, max_retries_=2, model=text-embedding-..., organization=None, timeout=10)...
INFO:lightrag:Logger initialized for working directory: /workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input
INFO:lightrag:Load KV llm_response_cache with 0 data
INFO:lightrag:Load KV full_docs with 0 data
INFO:lightrag:Load KV text_chunks with 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '/workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input/vdb_entities.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '/workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input/vdb_relationships.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '/workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input/vdb_chunks.json'} 0 data
INFO:lightrag:Creating a new event loop in main thread.
INFO:lightrag:[New Docs] inserting 1 docs
Chunking documents: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.01doc/s]
INFO:lightrag:[New Chunks] inserting 2 chunks
INFO:lightrag:Inserting 2 vectors to chunks
Generating embeddings:   0%|                                                                                                                                  | 0/1 [00:00<?, ?batch/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Generating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.18batch/s]
INFO:lightrag:[Entity Extraction]...
INFO:lightrag:Writing graph with 0 nodes, 0 edges

Running Kotaemon 0.9.11 locally.

mginfn avatar Jan 16 '25 15:01 mginfn

Also reported here: https://github.com/Cinnamon/kotaemon/issues/583

mginfn avatar Jan 16 '25 15:01 mginfn

Hi @gutama, I upgraded LightRAG from 1.0.8 to 1.1.2 and the issue seems to have gone, would you please make a try? Thanks

pip install lightrag-hku==1.1.2
pip uninstall hnswlib chroma-hnswlib && pip install chroma-hnswlib  # fix issue 562

mginfn avatar Jan 17 '25 08:01 mginfn

updated to lightrag 1.1.2 and still have the issue

GraphRAG embedding dim 3072 Indexing GraphRAG with LLM ChatOpenAI(api_key=sk-FKUCbQyYtEDR..., base_url=https://api.ope..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=gpt-4o-mini, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=20, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=sk-FKUCbQyYtEDR..., base_url=https://api.ope..., context_length=8191, dimensions=None, max_retries=None, max_retries_=2, model=text-embedding-..., organization=None, timeout=10)... Generating embeddings: 100%|███████████████████████████████████████████████████████████| 4/4 [00:13<00:00, 3.29s/batch] 2025-01-19T22:22:09.163045Z [error ] Failed to process document doc-32435f31e42714fe72ace9541cc7f785: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' Traceback (most recent call last): File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 463, in ainsert raise e File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 422, in ainsert maybe_new_kg = await extract_entities( File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/operate.py", line 331, in extract_entities examples = examples.format(**example_context_base) KeyError: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' asctime=2025-01-20 05:22:09,162 lineno=469 message=Failed to process document doc-32435f31e42714fe72ace9541cc7f785: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' Traceback (most recent call last): File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 463, in ainsert raise e File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 422, in ainsert maybe_new_kg = await extract_entities( File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/operate.py", line 331, in extract_entities examples = examples.format(**example_context_base) KeyError: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' module=lightrag Processing batch 1: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.96s/it]

gutama avatar Jan 19 '25 22:01 gutama

this may come from the function reading and customizing lightrag prompts, i fix this in my fork by deleting that function https://github.com/RoadToNowhereX/kotaemon/commit/bda904ce847c8172140a92087bbbaa462b832a6f

RoadToNowhereX avatar Jan 21 '25 07:01 RoadToNowhereX

the problem come could be from tuple_delimter setting was not set or it is different in windows/linux. Where can we set this tuple_delimiter setting?

gutama avatar Jan 21 '25 07:01 gutama

the problem come could be from tuple_delimter setting was not set or it is different in windows/linux. Where can we set this tuple_delimiter setting?

For MS GraphRAG, it's in graphrag/index/operations/extract_entities/graph_extractor.py

Image

For nano-GraphRAG, it's in nano_graphrag/_op.py and nano_graphrag/prompt.py

Image

For LightRAG, it's in lightrag/prompt.py and lightrag/operate.py

Image

RoadToNowhereX avatar Jan 21 '25 07:01 RoadToNowhereX

the problem come could be from tuple_delimter setting was not set or it is different in windows/linux. Where can we set this tuple_delimiter setting?

i think point is that function reading prompt.py. ideally we should get a string '{tuple_delimiter}' , but now we get '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' so i just skip this step reading from prompt.py

RoadToNowhereX avatar Jan 21 '25 07:01 RoadToNowhereX

Thanks @RoadToNowhereX for your workaround, removing that function fixes the problem.

Just to give some insights, the following prompt from LightRAG is defined as an array

PROMPTS["entity_extraction_examples"] = [
    """Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
...
]

but while debugging Kotaemon I noticed that it is flattened to a string type, then the following line inserts "\n" between each character issuing the error we are discussing:

examples = "\n".join(PROMPTS["entity_extraction_examples"])

Hope it can help.

mginfn avatar Jan 21 '25 08:01 mginfn

Thanks @RoadToNowhereX for your workaround, removing that function fixes the problem.

Just to give some insights, the following prompt from LightRAG is defined as an array

PROMPTS["entity_extraction_examples"] = [
    """Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
...
]

but while debugging Kotaemon I noticed that it is flattened to a string type, then the following line inserts "\n" between each character issuing the error we are discussing:

examples = "\n".join(PROMPTS["entity_extraction_examples"])

Hope it can help.

thank you! so this "content" here should be an array instead of a string. but how do we assign type of variable in python? i know few about python.

Image

RoadToNowhereX avatar Jan 21 '25 09:01 RoadToNowhereX

if I use gpt-4o, the errors not showing up though

gutama avatar Jan 21 '25 13:01 gutama

my fix on LightRAG and nano-GraphRAG https://github.com/Cinnamon/kotaemon/pull/643 works on LightRAG 1.1.3

RoadToNowhereX avatar Jan 23 '25 04:01 RoadToNowhereX

Fixed in new release thanks @RoadToNowhereX

taprosoft avatar Feb 03 '25 01:02 taprosoft