Michael Feil

Results 125 comments of Michael Feil

hej @casper-hansen I would be curious to implement this for https://github.com/michaelfeil/infinity. Do you see any road-blockers in regards to encoder only architectures? Will the GEMM kernels work for non-causal masked...

@gchhablani I am _relativley_ confident the following quantization code should do the trick. ```python class WeightOnlyInt8Linear(Module): __constants__ = ["in_features", "out_features"] in_features: int out_features: int weight: Tensor bias: Tensor scales: Tensor...

Yeah, you need an onnx model. https://huggingface.co/Xenova/all-MiniLM-L6-v2

Does this work @netw0rkf10w ?

In this case, youre starting / stopping the engine. Instead of .. async with , you csn also call engine.astart() and engine.astop(). This should take the most time.

Updated the docs and the readme! @netw0rkf10w . Note that it should not be significantly faster for 1 embedding with 1 short sentence. Expect significant speedups for large batches /...

Assuming this can be closed, as there is no further activity?

@stephenleo There are two things to check here. 1. infinity normalizes your encodings. For a good reason: the magnitude of the embeddings is mostly irrelevant, and will likely lead to...

You typically do data-parallel style inference on sentence-transformers. TP is used when one GPU can't handle the desired batch size or the model at all. Unless there are some compelling...