Michael Feil comments

Results 125 comments of


                                            Michael Feil

Experiment with implementing AWQ for BERT models

hej @casper-hansen I would be curious to implement this for https://github.com/michaelfeil/infinity. Do you see any road-blockers in regards to encoder only architectures? Will the GEMM kernels work for non-causal masked...

What happens to bias during int8 quantization?

@gchhablani I am _relativley_ confident the following quantization code should do the trick. ```python class WeightOnlyInt8Linear(Module): __constants__ = ["in_features", "out_features"] in_features: int out_features: int weight: Tensor bias: Tensor scales: Tensor...

ValueError: No onnx files found

Yeah, you need an onnx model. https://huggingface.co/Xenova/all-MiniLM-L6-v2

ValueError: No onnx files found

Does this work @netw0rkf10w ?

ValueError: No onnx files found

In this case, youre starting / stopping the engine. Instead of .. async with , you csn also call engine.astart() and engine.astop(). This should take the most time.

ValueError: No onnx files found

Updated the docs and the readme! @netw0rkf10w . Note that it should not be significantly faster for 1 embedding with 1 short sentence. Expect significant speedups for large batches /...

ValueError: No onnx files found

Assuming this can be closed, as there is no further activity?

Different results with mixedbread-ai/mxbai-embed-large-v1 model

@stephenleo There are two things to check here. 1. infinity normalizes your encodings. For a good reason: the magnitude of the embeddings is mostly irrelevant, and will likely lead to...

Different results with mixedbread-ai/mxbai-embed-large-v1 model

Awesome!

Tensor-parallelism for multi-gpu support

You typically do data-parallel style inference on sentence-transformers. TP is used when one GPU can't handle the desired batch size or the model at all. Unless there are some compelling...