LMOps
LMOps copied to clipboard
[AdaptLLM] How to evaluate the models' performance?
Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.
Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?
Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows:
The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?