K-NRM icon indicating copy to clipboard operation
K-NRM copied to clipboard

Dataset Preparation

Open alishiba14 opened this issue 5 years ago • 9 comments

Hi, I have a dataset of 16000 docs and I have some queries. For each query there can be more than one relevant document. Can you tell me how can I prepare my data and also the evaluation?

alishiba14 avatar May 13 '19 09:05 alishiba14

Hi,

Are your labels binary (relevant / non-relevant)?

If so, use a baseline ranker, e.g. BM25, to retrieve top 100 documents for a query. Then a training instance is (query, a relevant doc, a non-relevant doc in top100)

You may want to random sample the non-relevant documents in top100 instead of using all of them.

On Mon, May 13, 2019 at 2:15 AM Alishiba Dsouza [email protected] wrote:

Hi, I have a dataset of 16000 docs and I have some queries for each query there can be more than one relevant document. Can you tell me how can I prepare my data and also the evaluation?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGBNPGEAEFG3YGTHTETPVEWRJANCNFSM4HMN3GKQ .

AdeDZY avatar May 13 '19 13:05 AdeDZY

Let's say for a query 'Apple' the relevant docs are 100,120,400 and rest all are non-relevant then 'Apple' \t 100,200,400\t remaining docs \t score. Is it the correct representation of the training instance?

alishiba14 avatar May 13 '19 14:05 alishiba14

@alishiba14 I think its true, but how can we represent remaining docs? And score, I think we can get it from BM25, right?

giangnguyen2412 avatar May 21 '19 01:05 giangnguyen2412

Let's say a query is 'Apple', and its relevant documents: Very relevant doc (score=2): 'iPhone X - apple.com', Somehow relevant doc (score=1): 'apple inc - wikipedia',

and 10 other non-relevant doc retrieved from BM25: Non-rel doc1 (score=0): 'apple juice is healthy'. Non-rel doc2 (score=0): 'apple is red' ... The training instance is: Apple \t iPhone X apple com \t apple juice is healthy \t 2 Apple \t iPhone X apple com \t apple is red \t 2 Apple \t apple inc - wikipedia \t apple juice is healthy \t 1 Apple \t apple inc - wikipedia \t apple is red \t 1

Then we map the words to word ids.

On Mon, May 20, 2019 at 9:41 PM Giang Nguyen [email protected] wrote:

@alishiba14 https://github.com/alishiba14 I think its true, but how can we represent remaining docs? And score, I think we can get it from BM25, right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGHBN6P7WGICDCDKY3LPWNHLPA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2QEAA#issuecomment-494207488, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGCBH4GE5BX4R72X3HLPWNHLPANCNFSM4HMN3GKQ .

-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

AdeDZY avatar May 21 '19 01:05 AdeDZY

Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you.

giangnguyen2412 avatar May 21 '19 02:05 giangnguyen2412

Which dataset are you using? Is it your own dataset? I guess you need to pass over a traditional IR and take the results.

On Mon, May 20, 2019 at 10:17 PM Giang Nguyen [email protected] wrote:

Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGCZUQ6KMSTSIIJF7WDPWNLSTA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2RU6I#issuecomment-494213753, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH2GOFUFENUJGTEH6DPWNLSTANCNFSM4HMN3GKQ .

-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

AdeDZY avatar May 21 '19 06:05 AdeDZY

I guess it is clear now. I will try and do it and get back to you. Thanks a lot for your reply. :)

alishiba14 avatar May 21 '19 06:05 alishiba14

: )

On Tue, May 21, 2019 at 2:58 AM Alishiba Dsouza [email protected] wrote:

I guess it is clear now. I will try and do it and get back to you. Thanks a lot for your reply. :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGE5BO3YL52ZY26QMJTPWOMRDA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV26EOI#issuecomment-494264889, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH4MRYOQROK6T7745LPWOMRDANCNFSM4HMN3GKQ .

-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

AdeDZY avatar May 21 '19 06:05 AdeDZY

Which dataset are you using? Is it your own dataset? I guess you need to pass over a traditional IR and take the results. On Mon, May 20, 2019 at 10:17 PM Giang Nguyen @.***> wrote: Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=ABHQHGCZUQ6KMSTSIIJF7WDPWNLSTA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2RU6I#issuecomment-494213753>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH2GOFUFENUJGTEH6DPWNLSTANCNFSM4HMN3GKQ . -- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

Its also what I thought, thanks for reply.

giangnguyen2412 avatar May 21 '19 07:05 giangnguyen2412