PURE icon indicating copy to clipboard operation
PURE copied to clipboard

How to run a pretrained model on unlabeled data?

Open serenalotreck opened this issue 2 years ago • 5 comments

Hi,

I'm looking to apply your pretrained models to an unlabeled, new dataset. I have my dataset in DyGIE format. Looking at the script, it's unclear to me how to do this, becuase there are only two blocks of code in the script. The first is if args.do_train:, where the model is trained, and the second is if args.do_eval:, where the model is evaluated.

I don't want to train, since I'm using a pre-trained model, but I also don't want to evaluate, since my data don't have labels, which makes my use case different than the example of applying the pretrained scibert models to the scierc dataset.

Wondering if you have pointers on how to do this?

Thanks!

serenalotreck avatar Sep 13 '21 14:09 serenalotreck

Hi! I guess the easiest way for you to do this is to still create the "ner" and "relations" field in your unlabeled dataset, but set them to be empty for each sentence. For example, if a document contains 4 sentences, you can set the "ner" and "relations" as {..., "ner": [[], [], [], []], "relations":[[], [], [], []], ...}. After that, you can use --do_eval to generate the prediction file (and ignore the evaluation results in that case).

Thanks for pointing this out! I plan to add a --do_predict feature soon. For now, I think this could be an easy way to do only prediction.

a3616001 avatar Sep 13 '21 21:09 a3616001

I just wanted to check in to see if you thought the --do-predict feature would be available soon!

serenalotreck avatar Dec 06 '21 16:12 serenalotreck

Just wanted to leave an update for anyone trying this -- your data file should be in a file called dev.json -- I originally had mine in test.json & couldn't get it to work, but it ran once I changed it to dev.json!

Edit: I had typed test.dev, but it should be test.json

serenalotreck avatar Jan 17 '22 22:01 serenalotreck

Hi, allow me to ask a simple question. What is doc_key? According to 'please make sure doc_key can be used to identify a certain document', should I find any document in the sciERC processed data?

Hubotcoder avatar Jan 19 '23 03:01 Hubotcoder

Hi! I guess the easiest way for you to do this is to still create the "ner" and "relations" field in your unlabeled dataset, but set them to be empty for each sentence. For example, if a document contains 4 sentences, you can set the "ner" and "relations" as {..., "ner": [[], [], [], []], "relations":[[], [], [], []], ...}. After that, you can use --do_eval to generate the prediction file (and ignore the evaluation results in that case).

Thanks for pointing this out! I plan to add a --do_predict feature soon. For now, I think this could be an easy way to do only prediction.

I would like to know if the prediction function of the model on the unlabeled dataset has been updated, and where I can see the relevant code, thank you very much

Shike-Cheng avatar May 09 '23 13:05 Shike-Cheng