OpenNIR
OpenNIR copied to clipboard
Is it possible to export models?
Is it possible to export models so that we can use it outside of your ranking pipeline? For example BERT models fine tuned on MSMARCO.
Yeah, all the model weights an so on are stored in ~/data/onir/models/default/{ranker}/{vocab}/{trainer}/{train_dataset}/weights/
-- there you'll find 3 files-- one corresponding to the initial weights, one for the optimal epoch from the validation set, and one from the final epoch validated (so the pipeline can continue training from there if needed).
Is there a particular format you'd need for the export?
Yes, I would like to export the sledge models. But just to be clear, the objective would be to predict with it from python.
Similar to what we do with sentence-transformers library. Example:
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
corpus_embeddings = embedder.encode(["this is a sentence"])
Not sure if that is possible. Haven't found any doc.
Ah, sorry I misunderstood what you meant by export.
OpenNIR was designed primarily with a CLI in mind. If you have other queries you want to run on the same dataset, I have a quick+dirty suggestion here. There's also the flex
dataset if you want an alternative collection of documents as well.
If you're not running an IR experiment (or you'd prefer not to use the CLI), it's possible to create the underlying objects. You'll want to use VanillaTransformer
and BertVocab
for the SLEDGE models. The CLI does a lot of the heavy lifting regarding configuration and so on (e.g., you'll need to set bert_base
of the vocab to scibert
and so forth manually).
I was looking for a way to load the model into huggingface's model.
These need a config.json file and model.bin file. I was wondering what format these file is, and how to convert it to something that can be opened in hf. I tried this
from transformers import BertModel
BertModel.from_pretrained('/content/sledge-med.p')
And got
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The sledge-med.p
file (and all weight files from OpenNIR, for that matter) is just from torch.save
-- a pickle-encoded dict of pytorch tensors, if I recall properly. To load into the transformers library, you'll need to rename some of the parameters because the VanillaTransformer
ranker uses BERT as a sub-module. The transformers config should be the same as the SciBERT config file.
Hope this helps!
Has anyone been able to create a python object of the model? I made an attempt here https://colab.research.google.com/drive/15Gak_LmwEWPbJo3w_EVG8FyfLMWPwoGh?usp=sharing
But wasn't able to successfully able to create the model.
You're trying to load up a transformers
version of it, right? If so, this should do the trick! https://colab.research.google.com/drive/1t5UdW2Jebue1php888ldDll6yG5jQXQQ?usp=sharing (based off starting point from link above.)
I have not tested on an actual ranking task, but it's able to load properly. And it worked with a quick toy example.
Does this meet your needs too, @thigm85?
Thanks Sean, very much appreciated!
You're trying to load up a
transformers
version of it, right? If so, this should do the trick! https://colab.research.google.com/drive/1t5UdW2Jebue1php888ldDll6yG5jQXQQ?usp=sharing (based off starting point from link above.)I have not tested on an actual ranking task, but it's able to load properly. And it worked with a quick toy example.
thanks, I am using that as well. Could you clarify what the scores that you are showing in that example mean? I understand these are the logits of the classification head of the CLS token. Specifically, you show the score of 0-th class, does this correspond to the relevant, or non relevant score?
Hi @timbmg,
Yeah, the 0th class corresponds to the relevance score (using the convention from Nogueira et al)
- sean