ChatDoctor
ChatDoctor copied to clipboard
Could I know how do you evaluate your model performance?
See title. What's the dataset? Did you run any evaluation steps?
We performed a blind evaluation of ChatDoctor and ChatGPT against each other to fairly assess their medical capabilities. In the comparison of recommending medications based on diseases, our ChatDoctor achieved 91.25% accuracy compared to ChatGPT’s 87.5%.
This is what is described in the paper. I assume the authors manually rated correctness of random samles (a multiple of 80) and reported the results as described above.
@mahlernim Thanks for the reply. The tricky thing is they share some samples but those example don't seems to be able to calculate the accuracy (91.25% vs 87.5%). I think normally we use Exact Match or multi-choice to calculate the accuracy. Those questions are hard to evaluate based on my understanding. Do you have any clues?
@Jeffwan No clues actually. My educated guess is they asked blinded human experts to grade the "correctness" in a binary fashion.