annlite
annlite copied to clipboard
Indexing of long text documents are tricky
Hello,
my use case is the search in long text documents.
Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding.
I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed.
If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs)
then when loading query flow it throws error.
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.
10 means (5 root docs, and 5 chunks together - dummy data)
Can you please help me Thanks
We are working on a feature that will allow the user to have multiple indices
and sub_indices
around the same DocArray API, I think this could be useful for you?
We are working on a feature that will allow the user to have multiple
indices
andsub_indices
around the same DocArray API, I think this could be useful for you?
I don't know yet, how it will look like. But Document's nested structure (chunks are senteces from long text) are suitable for this case, just annlite indexer doesn't allow to index (just store) documents without embeddings.
in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work
in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work
could you please explain how would sub_indices work. When do you plan to implement it ?
Thanks
@tommykoctur The subindex has been released. https://docarray.jina.ai/fundamentals/documentarray/subindex/
Thank you, but I don't think that this would help me. I would probably add another LMDB to store root doc information to save some space.