llama_index
llama_index copied to clipboard
doc_id missing from source_nodes in GPTQdrantIndex queries
If I query against GPTSimpleVectorIndex, the response has source nodes that can be looked at to determine the original document id. Very useful behavior.
If I query against GPTQdrantIndex, source nodes don't have this information.
Looking at the GPTQdrant index creation code, doc_id seems to enter the index as payload though.
Problem is Probably here
@dataclass
class SourceNode(DataClassJsonMixin):
"""Source node.
User-facing class containing the source text and the corresponding document id.
"""
source_text: str
doc_id: Optional[str]
extra_info: Optional[Dict[str, Any]] = None
node_info: Optional[Dict[str, Any]] = None
# distance score between node and query, if applicable
similarity: Optional[float] = None
@classmethod
def from_node(cls, node: Node, similarity: Optional[float] = None) -> "SourceNode":
"""Create a SourceNode from a Node."""
from IPython import embed; embed()
return cls(
source_text=node.get_text(),
doc_id=node.ref_doc_id,
extra_info=node.extra_info,
node_info=node.node_info,
similarity=similarity,
)
doc_id
looks for ref_doc_id
Why does the behavior differ between indexes?
@kacperlukawski are you able to help with this by any chance? Actually I took a look at GPTQdrantIndexQuery
and I think it's a matter of swapping
node = Node(
doc_id=payload.get("doc_id"),
text=payload.get("text"),
)
for
node = Node(
ref_doc_id=payload.get("doc_id"),
text=payload.get("text"),
)
That seems to be it! I made a small PR :)
thanks @Mikkolehtimaki !!