ragas Non-deterministic scores

Hey, apologies for creating multiple issue threads so rapidly, but I like this library and hope it gains more traction!

Some metrics are giving different results each time they are ran even with temperature set to 0. For example, Faithfulness and AnswerCorrectness.

I believe this might be due to using LangChain ChatOpenAI instead of OpenAI, where the latter appears more deterministic with temperature set to 0, but I'm not positive about this.

Oct 31 '23 01:10 austinmw

I got the same issue as well, every time I run the evaluation, it will give slight or significant different scores for each question sometime. Any advice @shahules786 ?

Is it normal ? if so, what is the reason behind this ? to what extent the difference can be tolerated ? Should not be giving the same score for each time we run evaluation ?

Nov 20 '23 02:11 ihgumilar

I guess the new seed feature would be helpful here

Nov 20 '23 05:11 austinmw

Hi @austinmw @ihgumilar , thanks for sharing your thoughts & concerns. The scores can vary due to the non-deterministic nature of LLMs, I have seen this happening more with closed-sourced services. The new seed feature from OpenAI might be useful, but I have yet to experiment with it. But I would want something more generalization beyond OpenAI.

To cut down the noise you can try to run the aggregate results across more data points @ihgumilar

Self-consistency checks can be a good way to improve reproducibility.

Nov 20 '23 06:11 shahules786

Thanks a lot for the suggestion guys @shahules786 @austinmw . Any suggestion what we can do ? @austinmw's suggestion could be implemented easily I believe, we can just add seed parameter. But not sure for other models.

Cheers

Nov 21 '23 17:11 ihgumilar

I'm experiencing the same issue. I looked into this and in my case it seems that the passed temperature value is not respected, it's either set to 0.2 or 1e-8. https://github.com/explodinggradients/ragas/blob/v0.0.22/src/ragas/llms/langchain.py#L200

1e-8 is very close to 0 but not zero. Is this a bug or an intended feature?

Also there seems to be a lack of consistency between the async and non-async implementation as for async it's always set to 0: https://github.com/explodinggradients/ragas/blob/v0.0.22/src/ragas/llms/langchain.py#L146

Jan 29 '24 16:01 jgege

I'm also experiencing score fluctuations between runs. Is the seed feature implemented yet?

Apr 12 '24 03:04 tranhoangnguyen03

@shahules786 FYI I fixed the seed but it didn't work...

Jun 17 '24 09:06 simon19891101