annlite icon indicating copy to clipboard operation
annlite copied to clipboard

Indexing of long text documents are tricky

Open tommykoctur opened this issue 2 years ago • 6 comments

Hello,

my use case is the search in long text documents. Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding. I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed. If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs) then when loading query flow it throws error.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.

10 means (5 root docs, and 5 chunks together - dummy data)

Can you please help me Thanks

tommykoctur avatar Jul 19 '22 12:07 tommykoctur

We are working on a feature that will allow the user to have multiple indices and sub_indices around the same DocArray API, I think this could be useful for you?

JoanFM avatar Jul 19 '22 12:07 JoanFM

We are working on a feature that will allow the user to have multiple indices and sub_indices around the same DocArray API, I think this could be useful for you?

I don't know yet, how it will look like. But Document's nested structure (chunks are senteces from long text) are suitable for this case, just annlite indexer doesn't allow to index (just store) documents without embeddings.

tommykoctur avatar Jul 19 '22 12:07 tommykoctur

in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work

JoanFM avatar Jul 19 '22 12:07 JoanFM

in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work

could you please explain how would sub_indices work. When do you plan to implement it ?

Thanks

tommykoctur avatar Jul 19 '22 12:07 tommykoctur

@tommykoctur The subindex has been released. https://docarray.jina.ai/fundamentals/documentarray/subindex/

numb3r3 avatar Aug 23 '22 10:08 numb3r3

Thank you, but I don't think that this would help me. I would probably add another LMDB to store root doc information to save some space.

tommykoctur avatar Aug 23 '22 11:08 tommykoctur