llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

Why do we need both docstore and index_struct? They seem very similar.

Open ymansurozer opened this issue 2 years ago • 4 comments

Hi there! I am sorry if this is a rookie question but when I inspect the docstore and index_struct, them seem very similar and a bit duplicated. Index file size is a big concern for me so I was thinking why we need both of them?

I tried deleting them one by one but then querying does not work at all. Both seems to be required.

ymansurozer avatar Feb 04 '23 11:02 ymansurozer

@ymansurozer we need docstore for storing multiple index structs (say in the case of composability). For single indices we don't need docstore. That said docstore should just store a reference to the same index struct, so there shouldn't be duplicates! (I think) Have you experienced issues here?

jerryjliu avatar Feb 06 '23 07:02 jerryjliu

@jerryjliu Ah, I see and that makes perfect sense. But still, when you get embeddings for large bodies of text, there is some duplication (or am I doing something wrong?). Here is a Faiss index file (not core) I've just created from a single file containing the text 'Containts Document content':

{
    "index_struct": {
        "text": null,
        "doc_id": "c0fe2a1e-a69f-4e10-bd6f-07472df43685",
        "embedding": null,
        "extra_info": null,
        "nodes_dict": {
            "8407316112998839724": {
                "text": "Contains Document content",
                "doc_id": null,
                "embedding": null,
                "extra_info": null,
                "index": 0,
                "child_indices": [],
                "ref_doc_id": "e150af44-ea2f-4d62-93b0-98e97c9d9532",
                "node_info": { "start": 0, "end": 25 }
            }
        },
        "id_map": { "0": 8407316112998839724 }
    },
    "docstore": {
        "docs": {
            "e150af44-ea2f-4d62-93b0-98e97c9d9532": {
                "text": "Contains Document content",
                "doc_id": "e150af44-ea2f-4d62-93b0-98e97c9d9532",
                "embedding": null,
                "extra_info": null,
                "__type__": "Document"
            },
            "c0fe2a1e-a69f-4e10-bd6f-07472df43685": {
                "text": null,
                "doc_id": "c0fe2a1e-a69f-4e10-bd6f-07472df43685",
                "embedding": null,
                "extra_info": null,
                "nodes_dict": {
                    "8407316112998839724": {
                        "text": "Contains Document content",
                        "doc_id": null,
                        "embedding": null,
                        "extra_info": null,
                        "index": 0,
                        "child_indices": [],
                        "ref_doc_id": "e150af44-ea2f-4d62-93b0-98e97c9d9532",
                        "node_info": { "start": 0, "end": 25 }
                    }
                },
                "id_map": { "0": 8407316112998839724 },
                "__type__": "dict"
            }
        }
    }
}

You can see 'Contains Document content' is repeated three times. In SimpleVectorIndex, the embeddings are repeated three times, too. The ID map should be sufficient to keep references so I thought there should be a way around this but now that you say there shouldn't be duplications, I am more thinking if I'm doing something wrong. :)

ymansurozer avatar Feb 06 '23 09:02 ymansurozer

@jerryjliu Just wanted to follow up and ask if you have any ideas about this because removing this duplication would drastically reduce index file sizes.

ymansurozer avatar Feb 12 '23 21:02 ymansurozer

@jerryjliu Just wanted to follow up and ask if you have any ideas about this because removing this duplication would drastically reduce index file sizes.

apologies for the delayed response. yes you're right - this is something i'll try to look into tonight + tmrw

jerryjliu avatar Feb 17 '23 03:02 jerryjliu