llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

[Bug]: cannot get document node

Open yingfhu opened this issue 11 months ago • 4 comments

Bug Description

The get_document function cannot get any document nodes The following code uses the node_collections which does not include any document node. https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/storage/docstore/keyval_docstore.py#L276

Version

master

Steps to Reproduce

mini_index = VectorStoreIndex.from_documents([Document(text='hello world')])
mini_index.storage_context.docstore._kvstore.get_all(collection=mini_index.storage_context.docstore._ref_doc_collection)
mini_index.storage_context.docstore.get_document('*****replace with above doc id******')

Relevant Logs/Tracbacks

No response

yingfhu avatar Mar 22 '24 10:03 yingfhu

Hey @yingfhu, great to see you back! Hope you're doing well. 😄

Thank you for reporting this issue. The get_document function is designed to retrieve document nodes from a specified collection within a key-value store. If it's unable to retrieve any document nodes when using the node_collections, it could be due to several reasons:

  1. Non-existent Document ID: The doc_id provided to the get_document function does not exist within the _node_collection. This could happen if the document was never added to the collection or if it was removed.

  2. Incorrect Namespace or Collection Suffix: The namespace or node collection suffix used to construct the _node_collection might be incorrect or not match the actual location where the documents were stored. This discrepancy would lead to the function looking in the wrong collection.

  3. Data Storage Issues: There could be underlying issues with the key-value store's ability to retrieve data. This could be due to connectivity issues, permissions, or the store's internal errors.

  4. Empty or Corrupted Data: The document associated with the doc_id might be empty or corrupted in the key-value store, leading to a failure in retrieving or correctly deserializing the document.

To resolve the issue, ensure that the document ID exists within the correct collection, the namespace and collection suffix are correctly configured, and the key-value store is functioning properly. You can refer to the keyval_docstore.py file for more details.

Regarding your question about the structure and data type of the node_collections, I wasn't able to find an answer within the repository. It might be available elsewhere or I could have missed it.

The doc_id in the LlamaIndex repository is generated based on the node_id of a BaseNode object. The doc_id is not explicitly generated within the provided code but is directly used from the node_id attribute of the BaseNode instances that are being stored. The format of the doc_id is not explicitly defined in the provided context, implying it inherits the format of the node_id of a BaseNode.

The get_all function in the RedisKVStore class is to retrieve all key-value pairs from a specified collection in the Redis store. It does not directly interact with the get_document function. The get function, however, is present and serves a different purpose: it retrieves a single value from the store based on a provided key and collection name.

The VectorStoreIndex.from_documents method in the LlamaIndex repository is designed to load a set of documents and build an index from them. It's plausible that a get_document function, if it exists, would retrieve documents or their information from the index created by VectorStoreIndex.from_documents. However, without explicit documentation or context about get_document, any detailed explanation of interaction would be speculative.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Mar 22 '24 10:03 dosubot[bot]

I know it's confusing, but the docstore only stores nodes (as well as matadata about each parent document)

logan-markewich avatar Mar 22 '24 13:03 logan-markewich

Thanks for your reply. BTW, is there any interface that i can get the full document of a single node

yingfhu avatar Mar 25 '24 02:03 yingfhu

Once its broken into nodes, not really (mostly to avoid storing duplicates of the same text)

You can store in the metadata the name of the file though, to map back to it

logan-markewich avatar Mar 25 '24 03:03 logan-markewich