ChatDoctor Could I know how do you evaluate your model performance?

Could I know how do you evaluate your model performance?

Open Jeffwan opened this issue 1 year ago • 3 comments

See title. What's the dataset? Did you run any evaluation steps?

Mar 23 '23 07:03 Jeffwan

We performed a blind evaluation of ChatDoctor and ChatGPT against each other to fairly assess their medical capabilities. In the comparison of recommending medications based on diseases, our ChatDoctor achieved 91.25% accuracy compared to ChatGPT’s 87.5%.

This is what is described in the paper. I assume the authors manually rated correctness of random samles (a multiple of 80) and reported the results as described above.

Mar 29 '23 07:03 mahlernim

@mahlernim Thanks for the reply. The tricky thing is they share some samples but those example don't seems to be able to calculate the accuracy (91.25% vs 87.5%). I think normally we use Exact Match or multi-choice to calculate the accuracy. Those questions are hard to evaluate based on my understanding. Do you have any clues?

Mar 29 '23 22:03 Jeffwan

@Jeffwan No clues actually. My educated guess is they asked blinded human experts to grade the "correctness" in a binary fashion.

Apr 05 '23 13:04 mahlernim

ChatDoctor ChatDoctor copied to clipboard

Could I know how do you evaluate your model performance?

ChatDoctor
ChatDoctor copied to clipboard