kotaemon [BUG] Error: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n'

Description

I use ollama for llm and embedding lightrag, all the test connection is fine. when I upload text file it can do chunking and generating embedding but cannot do entity and relationship extraction

the error was [GraphRAG] Creating index... This can take a long time. [GraphRAG] Indexed 0 / 1 documents. Error: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n'

I don't have any other error info to solve the issue

Reproduction steps

Adding documents to doc store
indexing step took 0.11500287055969238
GraphRAG embedding dim 1024
Indexing GraphRAG with LLM ChatOpenAI(api_key=ollama, base_url=http://localhos..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=granite3.1-dense, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=ollama, base_url=http://localhos..., context_length=None, dimensions=None, max_retries=None, max_retries_=2, model=bge-m3, organization=None, timeout=None)...
Chunking documents: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.77doc/s]
Generating embeddings: 100%|███████████████████████████████████████████████| 4/4 [00:15<00:00,  3.76s/batch]
use_quick_index_mode False
reader_mode default
Chunk size: None, chunk overlap: None
Using reader TxtReader()
Got 0 page thumbnails
Adding documents to doc store
indexing step took 0.10607624053955078
GraphRAG embedding dim 768
Indexing GraphRAG with LLM ChatOpenAI(api_key=ollama, base_url=http://localhos..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=granite3.1-dense, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=ollama, base_url=http://localhos..., context_length=None, dimensions=None, max_retries=None, max_retries_=2, model=nomic-embed-text, organization=None, timeout=None)...
Chunking documents: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.83doc/s]
Generating embeddings: 100%|███████████████████████████████████████████████| 4/4 [00:11<00:00,  2.92s/batch]
use_quick_index_mode False
reader_mode default
Chunk size: None, chunk overlap: None
Using reader TxtReader()
Got 0 page thumbnails
Adding documents to doc store
indexing step took 0.11184024810791016
GraphRAG embedding dim 768
Indexing GraphRAG with LLM ChatOpenAI(api_key=ollama, base_url=http://localhos..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=granite3.1-dense, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=ollama, base_url=http://localhos..., context_length=None, dimensions=None, max_retries=None, max_retries_=2, model=nomic-embed-tex..., organization=None, timeout=None)...
Chunking documents: 100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.48doc/s]
Generating embeddings: 100%|███████████████████████████████████████████████| 4/4 [00:11<00:00,  2.92s/batch]

Screenshots

![image](https://github.com/user-attachments/assets/19a684cd-fb7c-432c-a9d8-a86cd353cea3)


![image](https://github.com/user-attachments/assets/7dcd0ba2-4640-4ffe-a783-c2e468bbf80f)

Logs

No response

Browsers

No response

OS

No response

Additional information

Jan 01 '25 14:01 gutama

I clean up all the ktem_app_data folder. It is running albeit very long time and didn't extract any relationship. Is there anything I missed?

Extracting entities from chunks: 100%|████████████████████████████████| 98/98 [3:12:57<00:00, 118.14s/chunk] Inserting entities: 100%|████████████████████████████████████████████████| 4/4 [00:00<00:00, 210.52entity/s] Inserting relationships: 0relationship [00:00, ?relationship/s] 2025-01-02T14:19:26.049446Z [warning ] Didn't extract any relationships asctime=2025-01-02 21:19:26,045 lineno=427 message=Didn't extract any relationships module=lightrag Generating embeddings: 100%|███████████████████████████████████████████████| 1/1 [00:13<00:00, 13.96s/batch] 2025-01-02T14:19:40.017763Z [warning ] You insert an empty data to vector DB asctime=2025-01-02 21:19:40,017 lineno=85 message=You insert an empty data to vector DB module=lightrag

Jan 02 '25 15:01 gutama

I'm getting the same error, upload info reports:

Indexing [1/1]: Overview.pdf
 => Converting Overview.pdf to text
 => Converted Overview.pdf to text
 => [Overview.pdf] Processed 8 chunks
 => Finished indexing Overview.pdf
[GraphRAG] Creating index... This can take a long time.
[GraphRAG] Indexed 0 / 4 documents.
Error: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n'

Console logs:

Using reader <kotaemon.loaders.pdf_loader.PDFThumbnailReader object at 0x7fc5c85dcdf0>
Page numbers: 4
Got 4 page thumbnails
Adding documents to doc store
indexing step took 0.515655517578125
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
GraphRAG embedding dim 1536
Indexing GraphRAG with LLM ChatOpenAI(api_key=kserve, base_url=http://212.189...., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=llama3, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=None, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=sk-proj-mORxO60..., base_url=https://api.ope..., context_length=8191, dimensions=None, max_retries=None, max_retries_=2, model=text-embedding-..., organization=None, timeout=10)...
INFO:lightrag:Logger initialized for working directory: /workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input
INFO:lightrag:Load KV llm_response_cache with 0 data
INFO:lightrag:Load KV full_docs with 0 data
INFO:lightrag:Load KV text_chunks with 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '/workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input/vdb_entities.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '/workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input/vdb_relationships.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1536, 'metric': 'cosine', 'storage_file': '/workspaces/kotaemon/ktem_app_data/user_data/files/lightrag/acd0abb6-fe43-4388-ba80-80a1df124837/input/vdb_chunks.json'} 0 data
INFO:lightrag:Creating a new event loop in main thread.
INFO:lightrag:[New Docs] inserting 1 docs
Chunking documents: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.01doc/s]
INFO:lightrag:[New Chunks] inserting 2 chunks
INFO:lightrag:Inserting 2 vectors to chunks
Generating embeddings:   0%|                                                                                                                                  | 0/1 [00:00<?, ?batch/s]INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Generating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.18batch/s]
INFO:lightrag:[Entity Extraction]...
INFO:lightrag:Writing graph with 0 nodes, 0 edges

Running Kotaemon 0.9.11 locally.

Jan 16 '25 15:01 mginfn

Also reported here: https://github.com/Cinnamon/kotaemon/issues/583

Jan 16 '25 15:01 mginfn

Hi @gutama, I upgraded LightRAG from 1.0.8 to 1.1.2 and the issue seems to have gone, would you please make a try? Thanks

pip install lightrag-hku==1.1.2
pip uninstall hnswlib chroma-hnswlib && pip install chroma-hnswlib  # fix issue 562

Jan 17 '25 08:01 mginfn

updated to lightrag 1.1.2 and still have the issue

GraphRAG embedding dim 3072 Indexing GraphRAG with LLM ChatOpenAI(api_key=sk-FKUCbQyYtEDR..., base_url=https://api.ope..., frequency_penalty=None, logit_bias=None, logprobs=None, max_retries=None, max_retries_=2, max_tokens=None, model=gpt-4o-mini, n=1, organization=None, presence_penalty=None, stop=None, temperature=None, timeout=20, tool_choice=None, tools=None, top_logprobs=None, top_p=None) and Embedding OpenAIEmbeddings(api_key=sk-FKUCbQyYtEDR..., base_url=https://api.ope..., context_length=8191, dimensions=None, max_retries=None, max_retries_=2, model=text-embedding-..., organization=None, timeout=10)... Generating embeddings: 100%|███████████████████████████████████████████████████████████| 4/4 [00:13<00:00, 3.29s/batch] 2025-01-19T22:22:09.163045Z [error ] Failed to process document doc-32435f31e42714fe72ace9541cc7f785: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' Traceback (most recent call last): File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 463, in ainsert raise e File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 422, in ainsert maybe_new_kg = await extract_entities( File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/operate.py", line 331, in extract_entities examples = examples.format(**example_context_base) KeyError: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' asctime=2025-01-20 05:22:09,162 lineno=469 message=Failed to process document doc-32435f31e42714fe72ace9541cc7f785: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' Traceback (most recent call last): File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 463, in ainsert raise e File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/lightrag.py", line 422, in ainsert maybe_new_kg = await extract_entities( File "/home/ginanjar/.local/lib/python3.10/site-packages/lightrag/operate.py", line 331, in extract_entities examples = examples.format(**example_context_base) KeyError: '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' module=lightrag Processing batch 1: 100%|█████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.96s/it]

Jan 19 '25 22:01 gutama

this may come from the function reading and customizing lightrag prompts, i fix this in my fork by deleting that function https://github.com/RoadToNowhereX/kotaemon/commit/bda904ce847c8172140a92087bbbaa462b832a6f

Jan 21 '25 07:01 RoadToNowhereX

the problem come could be from tuple_delimter setting was not set or it is different in windows/linux. Where can we set this tuple_delimiter setting?

Jan 21 '25 07:01 gutama

the problem come could be from tuple_delimter setting was not set or it is different in windows/linux. Where can we set this tuple_delimiter setting?

For MS GraphRAG, it's in graphrag/index/operations/extract_entities/graph_extractor.py

For nano-GraphRAG, it's in nano_graphrag/_op.py and nano_graphrag/prompt.py

For LightRAG, it's in lightrag/prompt.py and lightrag/operate.py

Jan 21 '25 07:01 RoadToNowhereX

the problem come could be from tuple_delimter setting was not set or it is different in windows/linux. Where can we set this tuple_delimiter setting?

i think point is that function reading prompt.py. ideally we should get a string '{tuple_delimiter}' , but now we get '\nt\nu\np\nl\ne\n_\nd\ne\nl\ni\nm\ni\nt\ne\nr\n' so i just skip this step reading from prompt.py

Jan 21 '25 07:01 RoadToNowhereX

Thanks @RoadToNowhereX for your workaround, removing that function fixes the problem.

Just to give some insights, the following prompt from LightRAG is defined as an array

PROMPTS["entity_extraction_examples"] = [
    """Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
...
]

but while debugging Kotaemon I noticed that it is flattened to a string type, then the following line inserts "\n" between each character issuing the error we are discussing:

examples = "\n".join(PROMPTS["entity_extraction_examples"])

Hope it can help.

Jan 21 '25 08:01 mginfn

Thanks @RoadToNowhereX for your workaround, removing that function fixes the problem.

Just to give some insights, the following prompt from LightRAG is defined as an array
PROMPTS["entity_extraction_examples"] = [
    """Example 1:

Entity_types: [person, technology, mission, organization, location]
Text:
...
]
but while debugging Kotaemon I noticed that it is flattened to a string type, then the following line inserts "\n" between each character issuing the error we are discussing:

examples = "\n".join(PROMPTS["entity_extraction_examples"])

Hope it can help.

thank you! so this "content" here should be an array instead of a string. but how do we assign type of variable in python? i know few about python.

Jan 21 '25 09:01 RoadToNowhereX

if I use gpt-4o, the errors not showing up though

Jan 21 '25 13:01 gutama

my fix on LightRAG and nano-GraphRAG https://github.com/Cinnamon/kotaemon/pull/643 works on LightRAG 1.1.3

Jan 23 '25 04:01 RoadToNowhereX

Fixed in new release thanks @RoadToNowhereX

Feb 03 '25 01:02 taprosoft