rust-bert icon indicating copy to clipboard operation
rust-bert copied to clipboard

Support all-mpnet-base-v2

Open diptanu opened this issue 2 years ago • 6 comments
trafficstars

I am looking into adding support for sentence-transformers/all-mpnet-base-v2. I have successfully extracted the rust weights and the models are here - https://huggingface.co/diptanuc/all-mpnet-base-v2

The SentenceEmbeddingBuilder doesn't however understand the mpnet architecture. Any thoughts on how new architectures can be added to the library?

diptanu avatar Apr 24 '23 00:04 diptanu

Hello, The mpnet architecture would have to be added as a supported model before it can be used for sentence embeddings. The steps are as follows:

  1. Create a MPNet tokenizer on https://github.com/guillaume-be/rust-tokenizers. It seems MPNet is mostly based on a BERT tokenizer so it may be possible to re-use most of the tokenization code and just define a MPNetVocab, or even possibly load MPNet tokenizer/vocab files directly in a BertTokenizer - this would have to be tested for equivalence
  2. Create a MPNet architecture, similar to the other model files. The model architecture looks fairly simple and should be straightforward to port to Rust.
  3. Register the new MPNet architecture for the supported classes (sequence classification, MLM, token classification, and sentence embeddings)

guillaume-be avatar Apr 24 '23 17:04 guillaume-be

@guillaume-be Thanks for your feedback! I will fork the repo, make the changes and send you a PR :)

diptanu avatar Apr 24 '23 20:04 diptanu

Ah WONDERFUL @diptanu even I was looking for this. WAITING for your results :zap: Thanks!

AJV009 avatar Apr 28 '23 00:04 AJV009

:grimacing: Anyone working on this :see_no_evil:

AJV009 avatar Jul 07 '23 21:07 AJV009

@AJV009 I am not sure - would you like to start working on it?

guillaume-be avatar Jul 11 '23 17:07 guillaume-be

I do have the time BUT I would require more guidance, I am just a rust beginner. :grin:

If you could just explain to me the points you mentioned here in a lil more detail @guillaume-be

Hello, The mpnet architecture would have to be added as a supported model before it can be used for sentence embeddings. The steps are as follows:

  1. Create a MPNet tokenizer on https://github.com/guillaume-be/rust-tokenizers. It seems MPNet is mostly based on a BERT tokenizer so it may be possible to re-use most of the tokenization code and just define a MPNetVocab, or even possibly load MPNet tokenizer/vocab files directly in a BertTokenizer - this would have to be tested for equivalence
  2. Create a MPNet architecture, similar to the other model files. The model architecture looks fairly simple and should be straightforward to port to Rust.
  3. Register the new MPNet architecture for the supported classes (sequence classification, MLM, token classification, and sentence embeddings)

So I can at least give it a try :) Also, I just wanted to mention it's a great initiative, this whole rust-bert thing, I tried using some sentence embedding for a real-time search application AND the embeddings were generated in less than 60ms :exploding_head: :zap: (Sure, I know the scene would change when having traffic multiple requests, but still so far impressive)

AJV009 avatar Jul 11 '23 17:07 AJV009