combine-FEVER-NSMN icon indicating copy to clipboard operation
combine-FEVER-NSMN copied to clipboard

Training the system with different data

Open j6mes opened this issue 5 years ago • 5 comments

Hi, is it possible to re-train the system with different data? What scripts do I need to run to do this? There seems a lot of python files and I'm not sure which ones to call.

j6mes avatar Apr 16 '19 13:04 j6mes

@j6mes Have you ever figured out how to retrain the system? I'm trying to get it working on Czech wiki, but it's very unclear how to move forward.

MichalPitr avatar Jun 25 '20 11:06 MichalPitr

No - I never needed to go through and retrain the entire system. For me, i got best value out of just putting it into a docker image and calling it as a black-box.

Perhaps @easonnie could advise on how to retrain the system?

j6mes avatar Jun 25 '20 11:06 j6mes

@j6mes Thanks for the reply, I have played around with your fork quite a lot, so thanks for the cleaned up version. Hopefully @easonnie finds the time to advise on retraining.

MichalPitr avatar Jun 25 '20 11:06 MichalPitr

@MichalPitr I had previously experimented with training their sentence retrieval and verification models. I do not have a compact version of the training code at the moment. I will just give you some quick steps and I think it is somewhat easy to figure out the rest.

  1. Make sure you can properly do inference based on their README, since this ensures that you have all the required installations in place
  2. Use the auto_pipeline.py to get the output of Document retrieval step for both training and dev datasets by setting the proper values for default_steps variable. Steps to be executed are from s1.tokenizing to s2.2.1doc_nn_retri. (Use the output files for rest of the training steps)
  3. For sentence retrieval training, use the method train_fever_v1 from the file src/sentence_retrieval/simple_nnmodel.py
  4. For claim verification training, use the method train_fever_v1_advsample from the file src/nli/mesim_wn_simi_v1_2.py

Let me know if you get stuck somewhere!

ShyamSubramanian avatar Jun 26 '20 13:06 ShyamSubramanian

@ShyamSubramanian Thanks, that's really useful. I am especially interested in getting the document retrieval working on my Czech wiki database, but the auto_pipeline.py uses a file id_dict.json that I haven't figured out how to generate using the code.

MichalPitr avatar Jun 26 '20 14:06 MichalPitr