erlexec icon indicating copy to clipboard operation
erlexec copied to clipboard

feat: combine embeddings

Open JohannesMessner opened this issue 3 years ago • 3 comments

Right now it is painful to combine embeddings from different nesting levels and set them at the top level, especially until https://github.com/jina-ai/docarray/issues/461 is solved.

So this should do the following: It takes an access path (or list of access paths?) (e.g. @.[image, main_text]), and combines the embeddings in those docs using one of 'sum', 'mean', 'concat', or a provided model.

It should handle the following cases:

  • [x] numpy embedding / predefined combiner
    • [x] test
  • [x] torch embedding / predefined combiner
    • [x] test
  • [ ] tf embedding / predefined combiner
    • [ ] test
  • [ ] paddle embedding / predefined combiner
    • [ ] test

  • [x] torch embedding model
    • [ ] test
  • [ ] tf embedding model
    • [ ] test
  • [ ] paddle embedding model
    • [ ] test
  • [ ] onnx embedding model
    • [ ] test

  • [x] numpy embedding / callable
    • [ ] test
  • [x] torch embedding / callable
    • [ ] test
  • [x] tf embedding / callable
    • [ ] test
  • [x] paddle embedding / callable
    • [ ] test

Other ToDos:

  • [ ] docs
  • [ ] rename the method: fuse_embeddings()?
  • [ ] refactor. Currently the code is quite ugly
  • [ ] refactor examples in the docs

Possible follow-up PRs:

  • [ ] enable a flag uniform_nesting which tells us that every doc in the da has the same number of relevant chunks. This would allow us to vectorize the combine operation
  • [ ] implement to_numpy
  • [ ] implement flag to discard chunk embeddings after root embedding has been set

Closes #512

JohannesMessner avatar Sep 08 '22 07:09 JohannesMessner