OpenNIR the effectiveness of the trained sledge-med.p model

Dear Sean MacAvaney,

I loaded the trained model sledge-med.p following the instructions in [https://colab.research.google.com/drive/1t5UdW2Jebue1php888ldDll6yG5jQXQQ?usp=sharing] and tried to reproduce the result on the trec-covid round1 dataset. However, the output seems not over-perform the classic BM25's result. Could you please verify if the uploaded sledge-med.p is effective or the instructions in the shared google doc are correct. Thank you so much!!!

Feb 22 '22 02:02 fangguo1

In your example ( https://colab.research.google.com/drive/1t5UdW2Jebue1php888ldDll6yG5jQXQQ?usp=sharing ), why do you ignore the pooler layer? And it seems in your toy example, the logits are from the token instead of from the whole sentence. Thanks and I really appreciate your reply.

Feb 22 '22 07:02 fangguo1

However, the output seems not over-perform the classic BM25's result. Could you please verify if the uploaded sledge-med.p is effective or the instructions in the shared google doc are correct

Thanks for reporting, I'm looking into this and will get back to you.

why do you ignore the pooler layer

That was just a relatively arbitrary design decision. Essentially whether to initialize the ranking score based on the results of the NSP task or not. I'm not sure if anybody has studied whether one way or other is more effective.

And it seems in your toy example, the logits are from the token instead of from the whole sentence.

I'm not sure what you mean here. It takes the representation from the first [CLS] token, which is the conventional way to represent the whole sequence. See Figure 3(a) here.

Feb 22 '22 13:02 seanmacavaney

Thanks for getting back so quickly. I tried some arbitrary text segments and calculate its relevant score with the query 'Is Hydroxycholoroquine effective?', the result is weird. For example, the relevant score of 'dog cat' is -1.3087, even higher than the two sentences in the toy example.

Feb 22 '22 13:02 fangguo1

Weird, maybe there's some problem with the Colab example I tried putting together. But I also suspect that the model isn't so robust to adversarial text like "dog cat" -- it's only trained on in-domain text.

The easiest way to reproduce the results is to run the following pipeline in OpenNIR:

bash scripts/pipeline.sh config/sledge/ pipeline.test=True

It takes about an hour (probably a bit longer if data needs to be downloaded), but I get:

SLDEGE:
judged@5=0.9800 ndcg@10=0.6917 p@5=0.7867 p_rel-2@5=0.6400
BM25:
judged@5=0.9200 ndcg@10=0.5156 p@5=0.6133 p_rel-2@5=0.4667

(Actually a bit better in terms of nDCG@10 than what we reported here.)

So that's the easiest way to start with reproduction.

I'm not totally sure why the transformers demo isn't working. If you like, I could try to provide an example using the PyTerrier integration, which would let you both use the model within Notebooks/Colab and use the OpenNIR internals directly. Let me know.

Feb 22 '22 14:02 seanmacavaney

Dear Sean, thanks for your reply and your patience. Yes, I would like to use the example using PyTerrier integration.

Feb 23 '22 00:02 fangguo1

Actually, I am not studying typical information retrieval but got inspired by your SledgeZ paper and plan to use a similar zero-shot learning idea on the data in my domain (medical related). So I tried to see whether the sledge-med.p is working well on other medical-related data.

Feb 23 '22 00:02 fangguo1

This should do the trick then! Here's a colab link: https://colab.research.google.com/drive/12EdgWMKbMJxmR8XrLUr74PbASsfI8g6N?usp=sharing

And the code:

import pandas as pd
import pyterrier as pt
if not pt.started():
  pt.init()
import onir_pt

sledgez = onir_pt.reranker.from_checkpoint('https://macavaney.us/files/pt-sledgez.tar.gz')

# Pass in the query/text pairs like so:
sledgez(pd.DataFrame([
  {'qid': '0', 'query': 'covid symptoms', 'text': 'SARC-COV2 symptoms include a b and c'},
  {'qid': '0', 'query': 'covid symptoms', 'text': 'dog cat'}
]))
# qid           query                                  text     score
#   0  covid symptoms  SARC-COV2 symptoms include a b and c  2.172534
#   0  covid symptoms                               dog cat -3.007984

Let me know if this works for you.

Feb 23 '22 15:02 seanmacavaney

OpenNIR OpenNIR copied to clipboard

the effectiveness of the trained sledge-med.p model

OpenNIR
OpenNIR copied to clipboard