llama_index
llama_index copied to clipboard
[Bug]: cannot get document node
Bug Description
The get_document
function cannot get any document nodes
The following code uses the node_collections
which does not include any document node.
https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/docstore/keyval_docstore.py#L276
Version
master
Steps to Reproduce
mini_index = VectorStoreIndex.from_documents([Document(text='hello world')])
mini_index.storage_context.docstore._kvstore.get_all(collection=mini_index.storage_context.docstore._ref_doc_collection)
mini_index.storage_context.docstore.get_document('*****replace with above doc id******')
Relevant Logs/Tracbacks
No response
Hey @yingfhu, great to see you back! Hope you're doing well. 😄
Thank you for reporting this issue. The get_document
function is designed to retrieve document nodes from a specified collection within a key-value store. If it's unable to retrieve any document nodes when using the node_collections
, it could be due to several reasons:
-
Non-existent Document ID: The
doc_id
provided to theget_document
function does not exist within the_node_collection
. This could happen if the document was never added to the collection or if it was removed. -
Incorrect Namespace or Collection Suffix: The namespace or node collection suffix used to construct the
_node_collection
might be incorrect or not match the actual location where the documents were stored. This discrepancy would lead to the function looking in the wrong collection. -
Data Storage Issues: There could be underlying issues with the key-value store's ability to retrieve data. This could be due to connectivity issues, permissions, or the store's internal errors.
-
Empty or Corrupted Data: The document associated with the
doc_id
might be empty or corrupted in the key-value store, leading to a failure in retrieving or correctly deserializing the document.
To resolve the issue, ensure that the document ID exists within the correct collection, the namespace and collection suffix are correctly configured, and the key-value store is functioning properly. You can refer to the keyval_docstore.py file for more details.
Regarding your question about the structure and data type of the node_collections
, I wasn't able to find an answer within the repository. It might be available elsewhere or I could have missed it.
The doc_id
in the LlamaIndex repository is generated based on the node_id
of a BaseNode
object. The doc_id
is not explicitly generated within the provided code but is directly used from the node_id
attribute of the BaseNode
instances that are being stored. The format of the doc_id
is not explicitly defined in the provided context, implying it inherits the format of the node_id
of a BaseNode
.
The get_all
function in the RedisKVStore
class is to retrieve all key-value pairs from a specified collection in the Redis store. It does not directly interact with the get_document
function. The get
function, however, is present and serves a different purpose: it retrieves a single value from the store based on a provided key and collection name.
The VectorStoreIndex.from_documents
method in the LlamaIndex repository is designed to load a set of documents and build an index from them. It's plausible that a get_document
function, if it exists, would retrieve documents or their information from the index created by VectorStoreIndex.from_documents
. However, without explicit documentation or context about get_document
, any detailed explanation of interaction would be speculative.
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
Sources
- llama-index-core/llama_index/core/storage/docstore/keyval_docstore.py
- llama-index-legacy/llama_index/legacy/storage/kvstore/redis_kvstore.py
- docs/module_guides/indexing/vector_store_index.md
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
I know it's confusing, but the docstore only stores nodes (as well as matadata about each parent document)
Thanks for your reply. BTW, is there any interface that i can get the full document of a single node
Once its broken into nodes, not really (mostly to avoid storing duplicates of the same text)
You can store in the metadata the name of the file though, to map back to it