infinity Late interaction models dimensions

Hello,

first of all, nice work!

I have been trying to understand the shapes of the colbert models. as far as I have seen the colbert-ir/colbertv2.0 has a dimension of (n_of_tokens - 1) x 768 the jina model: jinaai/jina-colbert-v2 has a dimension of (n_of_tokens - 1) x 1024

compared to the fastembed library the dimensions for both models is n_of_tokes x 128

the example in the provided notebook is about colipali and there to the embedding dimension is 128. Is there a compression step missing? or am I missing something else?

thanks!

Feb 21 '25 20:02 ahmedtalbi

Update: As far as I understood, the current implementation, using the optimum engine does not support the late interaction mechanism.

looking at the implementation in fastembed, the postprocessing of the each embedding vector of the output vectors is normalized allong the single embedding dimension -> axis=2

def _post_process_onnx_output(
        self, output: OnnxOutputContext, is_doc: bool = True
    ) -> Iterable[np.ndarray]:
        if not is_doc:
            return output.model_output.astype(np.float32)

        if output.input_ids is None or output.attention_mask is None:
            raise ValueError(
                "input_ids and attention_mask must be provided for document post-processing"
            )

        for i, token_sequence in enumerate(output.input_ids):
            for j, token_id in enumerate(token_sequence):
                if token_id in self.skip_list or token_id == self.pad_token_id:
                    output.attention_mask[i, j] = 0

        output.model_output *= np.expand_dims(output.attention_mask, 2).astype(np.float32)
        norm = np.linalg.norm(output.model_output, ord=2, axis=2, keepdims=True)
        norm_clamped = np.maximum(norm, 1e-12)
        output.model_output /= norm_clamped
        return output.model_output.astype(np.float32)

in the current optimum engine implementation, a cls or mean pooling is used and then the vectors are normalized accros the tokens dimension -> axis=1

@quant_embedding_decorator()
def encode_post(self, embedding: dict) -> EmbeddingReturnType:
    embedding = self.pooling(  # type: ignore
        embedding["token_embeddings"], embedding["attention_mask"]
     )

     return normalize(embedding).astype(np.float32)

the solution i found for now is to add encode_post_late_interaction function and to pass an option for late-interaction-model when starting the server.

let me know if I missing something here. If I am right, I can also create PR to add those changes.

Feb 26 '25 12:02 ahmedtalbi

I have the same confusion

https://github.com/michaelfeil/infinity/issues/559

Mar 17 '25 16:03 irelance

@michaelfeil any feedback?

Mar 20 '25 12:03 ahmedtalbi