OpenNIR
OpenNIR copied to clipboard
the effectiveness of the trained sledge-med.p model
Dear Sean MacAvaney,
I loaded the trained model sledge-med.p following the instructions in [https://colab.research.google.com/drive/1t5UdW2Jebue1php888ldDll6yG5jQXQQ?usp=sharing] and tried to reproduce the result on the trec-covid round1 dataset. However, the output seems not over-perform the classic BM25's result. Could you please verify if the uploaded sledge-med.p is effective or the instructions in the shared google doc are correct. Thank you so much!!!
In your example ( https://colab.research.google.com/drive/1t5UdW2Jebue1php888ldDll6yG5jQXQQ?usp=sharing ), why do you ignore the pooler layer? And it seems in your toy example, the logits are from the token instead of from the whole sentence. Thanks and I really appreciate your reply.
However, the output seems not over-perform the classic BM25's result. Could you please verify if the uploaded sledge-med.p is effective or the instructions in the shared google doc are correct
Thanks for reporting, I'm looking into this and will get back to you.
why do you ignore the pooler layer
That was just a relatively arbitrary design decision. Essentially whether to initialize the ranking score based on the results of the NSP task or not. I'm not sure if anybody has studied whether one way or other is more effective.
And it seems in your toy example, the logits are from the token instead of from the whole sentence.
I'm not sure what you mean here. It takes the representation from the first [CLS]
token, which is the conventional way to represent the whole sequence. See Figure 3(a) here.
Thanks for getting back so quickly. I tried some arbitrary text segments and calculate its relevant score with the query 'Is Hydroxycholoroquine effective?', the result is weird. For example, the relevant score of 'dog cat' is -1.3087, even higher than the two sentences in the toy example.
Weird, maybe there's some problem with the Colab example I tried putting together. But I also suspect that the model isn't so robust to adversarial text like "dog cat" -- it's only trained on in-domain text.
The easiest way to reproduce the results is to run the following pipeline in OpenNIR:
bash scripts/pipeline.sh config/sledge/ pipeline.test=True
It takes about an hour (probably a bit longer if data needs to be downloaded), but I get:
SLDEGE:
judged@5=0.9800 ndcg@10=0.6917 p@5=0.7867 p_rel-2@5=0.6400
BM25:
judged@5=0.9200 ndcg@10=0.5156 p@5=0.6133 p_rel-2@5=0.4667
(Actually a bit better in terms of nDCG@10 than what we reported here.)
So that's the easiest way to start with reproduction.
I'm not totally sure why the transformers demo isn't working. If you like, I could try to provide an example using the PyTerrier integration, which would let you both use the model within Notebooks/Colab and use the OpenNIR internals directly. Let me know.
Dear Sean, thanks for your reply and your patience. Yes, I would like to use the example using PyTerrier integration.
Actually, I am not studying typical information retrieval but got inspired by your SledgeZ paper and plan to use a similar zero-shot learning idea on the data in my domain (medical related). So I tried to see whether the sledge-med.p is working well on other medical-related data.
This should do the trick then! Here's a colab link: https://colab.research.google.com/drive/12EdgWMKbMJxmR8XrLUr74PbASsfI8g6N?usp=sharing
And the code:
import pandas as pd
import pyterrier as pt
if not pt.started():
pt.init()
import onir_pt
sledgez = onir_pt.reranker.from_checkpoint('https://macavaney.us/files/pt-sledgez.tar.gz')
# Pass in the query/text pairs like so:
sledgez(pd.DataFrame([
{'qid': '0', 'query': 'covid symptoms', 'text': 'SARC-COV2 symptoms include a b and c'},
{'qid': '0', 'query': 'covid symptoms', 'text': 'dog cat'}
]))
# qid query text score
# 0 covid symptoms SARC-COV2 symptoms include a b and c 2.172534
# 0 covid symptoms dog cat -3.007984
Let me know if this works for you.