Michael Feil
Michael Feil
hej @casper-hansen I would be curious to implement this for https://github.com/michaelfeil/infinity. Do you see any road-blockers in regards to encoder only architectures? Will the GEMM kernels work for non-causal masked...
@gchhablani I am _relativley_ confident the following quantization code should do the trick. ```python class WeightOnlyInt8Linear(Module): __constants__ = ["in_features", "out_features"] in_features: int out_features: int weight: Tensor bias: Tensor scales: Tensor...
Yeah, you need an onnx model. https://huggingface.co/Xenova/all-MiniLM-L6-v2
Does this work @netw0rkf10w ?
In this case, youre starting / stopping the engine. Instead of .. async with , you csn also call engine.astart() and engine.astop(). This should take the most time.
Updated the docs and the readme! @netw0rkf10w . Note that it should not be significantly faster for 1 embedding with 1 short sentence. Expect significant speedups for large batches /...
Assuming this can be closed, as there is no further activity?
@stephenleo There are two things to check here. 1. infinity normalizes your encodings. For a good reason: the magnitude of the embeddings is mostly irrelevant, and will likely lead to...
You typically do data-parallel style inference on sentence-transformers. TP is used when one GPU can't handle the desired batch size or the model at all. Unless there are some compelling...