mlx-examples Embeddings/MLX sentence transformers

Create a sentence-transformers like BERT model that can run MTEB with any BERT-based text embedding model. Direct integration with HuggingFace with no on-disk conversion step. Tests on bge-small to confirm ~similar scores to running with SentenceTransformers. Works with quantization.

Doesn't seem to be numerically identical to the HF model, but MTEB scores are similar so I think it's working.

Jan 09 '24 08:01 andersonbcdefg

I might have messed this up github-wise because it's showing all the changes you made to my previous speculative decoding example. LMK if this is a problem and I can re-try this PR. But the only things that should be changed are my 2 new files, plus the fix to the BERT pooler in the BERT example bc that was wrong.

Jan 09 '24 08:01 andersonbcdefg

Yep, happy to address all the points you've raised! For renaming the modules so that keys match, how would you suggest handling cases where the Transformers BERT model has more modular/nested modules? e.g. separate BERTIntermediate and BERTOutput layers? I don't think I can just stick periods into the names of modules, so it would be a choice between creating more modules than is strictly needed, vs. having some names that don't map over. I'll think about this too just curious your thoughts.

Jan 10 '24 01:01 andersonbcdefg

For renaming the modules so that keys match, how would you suggest handling cases where the Transformers BERT model has more modular/nested modules? e.g. separate BERTIntermediate and BERTOutput layers? I don't think I can just stick periods into the names of modules, so it would be a choice between creating more modules than is strictly needed, vs. having some names that don't map over. I'll think about this too just curious your thoughts.

@andersonbcdefg sorry for the delayed reply, I didn't notice your question.

You are right, you can't just put periods in the names. If avoiding the name mapping means a proliferation of silly little classes then I would just do the name mapping. E.g. I wouldn't bend over backwards to avoid remapping names.

Other than that, how is this PR coming along? Let me know if I can help at all, I'm excited to get it in!!

Jan 13 '24 16:01 awni

Hey, just made some updates responding to the comments here. I changed the setup of BERT so that fewer keys have to be swapped (still some, but fewer.) :) Also changed names of stuff as requested.

Jan 14 '24 05:01 andersonbcdefg

Great! We also need a readme. I can help with that just let me know you're plan / when I should review.

Jan 14 '24 06:01 awni

Ok, ready for review!

Jan 14 '24 08:01 andersonbcdefg

Hey @awni when do you think you'll get around to this one?

Jan 18 '24 04:01 andersonbcdefg

Hey! I will take a look shortly (next 1-2 days), sorry for the delay!

Jan 18 '24 05:01 awni

@andersonbcdefg sorry for the delay. I rebased this and ran the formatting. I'm doing a little work on it now.

Just curious, what were the results you were getting? For the MLX model in 32-bit I'm seeing:

{'Banking77Classification': 0.807987012987013, 'STS12': 0.7597192919985193}

Did you try the lower precision results? How were they? If they are not good, I'm not sure it makes sense to support quantization. We can maybe make float16 work instead.

Jan 25 '24 04:01 awni

Lower precision results were decent/usable I thought! But it's not super important to support given how small these models are; the overhead of quantizing/dequantizing probably isn't worth it in most scenarios. I'll let you make the call!

I am also updating the MTEB test to use normalize=False because BGE-small is expected to get 0.85 on BankingClassification, so 0.807 is a not-insignificant penalty. I got 0.83 with normalize=False, so I'll leave that setting.

Jan 25 '24 21:01 andersonbcdefg

Cool, what about F32, it's about 1% worse than the torch version. Did you see the same? I can spend a little time investigating, but I also want to make sure I am not doing anything wrong on my end.

Jan 25 '24 21:01 awni

See if it looks better on your end with normalize=False. It doesn't affect metrics based on cosine similarity, but does seem to make a difference for e.g. BankingClassification. If there is a bug, it's probably with pooling or normalization. There's so many different pooling configurations (use pooleroutput from BertPooler, use the cls token from hidden state, mean pool, etc.) that it's easy to mess up. It seems like using the PoolingOutput is the right thing to do for BGE-small though.

Jan 25 '24 21:01 andersonbcdefg

I think it helped, now I see:

{'Banking77Classification': 0.8325974025974027, 'STS12': 0.7584972019004673}

The STS12 is still a bit worse than the torch model.. but the banking classification is better..

Jan 26 '24 00:01 awni

Yup, that's what I was getting too. Not sure what explains the remaining discrepancy.

Jan 26 '24 06:01 andersonbcdefg

@andersonbcdefg could you comment a bit on what this example adds beyond the original MLX Bert example? Is it mostly the MTEB evaluation? If so, maybe the right call is to modify the MLX Bert example and add the sample evaluation script? What do you think?

Jan 26 '24 16:01 awni

Yeah, my thought is that it makes sentence embeddings usable in MLX, which requires the pooling, handling batches, truncation, etc. and also makes it easy to load the models. This would be ok as a BERT wrapper + evaluation script too.

On Fri, Jan 26, 2024 at 8:53 AM Awni Hannun @.***> wrote:

@andersonbcdefg https://github.com/andersonbcdefg could you comment a bit on what this example adds beyond the original MLX Bert example? Is it mostly the MTEB evaluation? If so, maybe the right call is to modify the MLX Bert example and add the sample evaluation script? What do you think?

— Reply to this email directly, view it on GitHub https://github.com/ml-explore/mlx-examples/pull/263#issuecomment-1912375761, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEDJ3R26FRCYHLBKFKCD2VDYQPNRXAVCNFSM6AAAAABBSWGL6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJSGM3TKNZWGE . You are receiving this because you were mentioned.Message ID: @.***>

Jan 26 '24 22:01 andersonbcdefg

I support embedding being a separate example. sentence-transformers models are hosted on huggingface hub but the info about the pooling layers are not stored in the .bin file

Ideally, instead of specifying pooling strategy in the constructor, the pooling layers should be constructed by reading config.json and the layer folders (1_Pooling, 2_Dense, 3_Normalize ... etc)

Jan 27 '24 03:01 sciencecw

The tokenizer by pytorch has poor performance and uses too much memory. Replacing it with mlx will be good news.😄

Feb 01 '24 02:02 sweetcard

Well, while this is stuck I'm working on a separate repo for embeddings in MLX.

Feb 10 '24 20:02 andersonbcdefg

@andersonbcdefg sorry I got kind of stuck on this myself w.r.t. to how it should integrate with our BERT example.

Maybe the answer is that it shouldn't..but then it doesn't add a ton of MLX-specific functionality beyond that. The pooling into a single embedding and evals on MTEB are very nicely done and I can very much see how they are useful, but I'm not sure it's unique enough to merit a standalone example here.

Maybe it makes more sense as a separate community built repo? That would allow you to move faster adding new features and decouple it from the constraints of MLX Examples. If you are going in that route, we should include it in our featured projects and I'd be happy to point people to it!

Let me know what you think!

Feb 10 '24 21:02 awni

works for me! i'll ping you when it's ready to share

Feb 10 '24 23:02 andersonbcdefg

Awesome can't wait!

Feb 10 '24 23:02 awni

mlx-examples mlx-examples copied to clipboard

Embeddings/MLX sentence transformers

mlx-examples
mlx-examples copied to clipboard