Non-deterministic scores
Hey, apologies for creating multiple issue threads so rapidly, but I like this library and hope it gains more traction!
Some metrics are giving different results each time they are ran even with temperature set to 0. For example, Faithfulness and AnswerCorrectness.
I believe this might be due to using LangChain ChatOpenAI instead of OpenAI, where the latter appears more deterministic with temperature set to 0, but I'm not positive about this.
I got the same issue as well, every time I run the evaluation, it will give slight or significant different scores for each question sometime. Any advice @shahules786 ?
Is it normal ? if so, what is the reason behind this ? to what extent the difference can be tolerated ? Should not be giving the same score for each time we run evaluation ?
I guess the new seed feature would be helpful here
Hi @austinmw @ihgumilar , thanks for sharing your thoughts & concerns. The scores can vary due to the non-deterministic nature of LLMs, I have seen this happening more with closed-sourced services. The new seed feature from OpenAI might be useful, but I have yet to experiment with it. But I would want something more generalization beyond OpenAI.
To cut down the noise you can try to run the aggregate results across more data points @ihgumilar
Self-consistency checks can be a good way to improve reproducibility.
Thanks a lot for the suggestion guys @shahules786 @austinmw . Any suggestion what we can do ? @austinmw's suggestion could be implemented easily I believe, we can just add seed parameter. But not sure for other models.
Cheers
I'm experiencing the same issue. I looked into this and in my case it seems that the passed temperature value is not respected, it's either set to 0.2 or 1e-8. https://github.com/explodinggradients/ragas/blob/v0.0.22/src/ragas/llms/langchain.py#L200
1e-8 is very close to 0 but not zero. Is this a bug or an intended feature?
Also there seems to be a lack of consistency between the async and non-async implementation as for async it's always set to 0: https://github.com/explodinggradients/ragas/blob/v0.0.22/src/ragas/llms/langchain.py#L146
I'm also experiencing score fluctuations between runs. Is the seed feature implemented yet?
@shahules786 FYI I fixed the seed but it didn't work...