llama_index
llama_index copied to clipboard
[Question]: I construct node relationships based on TextNode, but when I query the data of the second node, I always cannot find the content of the third paragraph
Question Validation
- [X] I have searched both the documentation and discord for an answer.
Question
The following is my node data structure. My first point is in node2, and my second and third points are in node3. When I use the CITION query engine to query, I can only find the first point in node2, and the missing point in node3,please help me { "node_id": "98292fcf-be12-4198-ba26-260e64d06432_1", "text": "first trunk content", "relationships": { "NEXT": { "node_id": "98292fcf-be12-4198-ba26-260e64d06432_2" } } }, { "node_id": "98292fcf-be12-4198-ba26-260e64d06432_2", "text": "second trunk content, 1.xxxxx", "relationships": { "PREVIOUS": { "node_id": "98292fcf-be12-4198-ba26-260e64d06432_1" }, "NEXT": { "node_id": "98292fcf-be12-4198-ba26-260e64d06432_3" } } }, { "node_id": "98292fcf-be12-4198-ba26-260e64d06432_3", "text": "third trunk content, 2.xxxxx,3.xxxxx", "relationships": { "PREVIOUS": { "node_id": "98292fcf-be12-4198-ba26-260e64d06432_2" } } }
🤖
Hello @rzechen, good to see you back here. I hope you're having a great day! Let's dive into this issue of yours.
Based on the node data structure you provided, it seems like you have correctly set up the relationships between your nodes. However, the issue you're experiencing might be due to how the CITION query engine is handling these relationships.
In the LlamaIndex framework, the CITION query engine handles node relationships through a recursive, bottoms-up approach. This is done by the build_index_from_nodes
method in the GPTTreeIndexBuilder
class. If the number of current node IDs is greater than the number of children, it prepares the node and text chunks using the _prepare_node_and_text_chunks
method. It then generates summaries for each text chunk and constructs parent nodes using the _construct_parent_nodes
method. This method inserts the new node into the index graph, saves the node to the document store, and returns a dictionary of new node IDs. The method then updates the all node IDs dictionary with the new node IDs and sets the root nodes of the index graph to the new node IDs. If the number of new node IDs is less than or equal to the number of children, it returns the index graph. Otherwise, it recursively calls itself with the updated parameters.
This means that if your third paragraph is within the same HTML tag as the second one (for example, both are within <p>
tags), the parser will treat them as a single node. Therefore, when you query the data of the second node, you're actually getting the content of both the second and third paragraphs.
To resolve this issue, you might need to modify the parsing logic to treat each paragraph as a separate node, even if they share the same HTML tag. Here's a possible modification to the get_nodes_from_node
method:
def get_nodes_from_node(self, node: BaseNode) -> List[TextNode]:
"""Get nodes from document."""
try:
from bs4 import BeautifulSoup
except ImportError:
raise ImportError("bs4 is required to read HTML files.")
text = node.get_content(metadata_mode=MetadataMode.NONE)
soup = BeautifulSoup(text, "html.parser")
html_nodes = []
tags = soup.find_all(self.tags)
for tag in tags:
tag_text = self._extract_text_from_tag(tag)
html_nodes.append(
self._build_node_from_split(
tag_text.strip(), node, {"tag": tag.name}
)
)
return html_nodes
This modification will create a new node for each tag found, regardless of whether it's the same as the previous one. Please note that this is a simple modification and might not cover all use cases, so you might need to adjust it to fit your specific needs.
You can find more details about this in the llama_index/node_parser/file/html.py and llama_index/indices/common_tree/base.py files.
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
-Dosu
Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.