Answer relevancy returns either zero or average cosine similarity
[ x ] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question I've observed that the current methodology for calculating answer relevancy results in a binary outcome: either a score of 0 or the average cosine similarity. This binary outcome occurs even if a single generated question is deemed non-committal, without considering the actual magnitude of cosine similarity between the generated and the actual questions. This approach leads to notable inconsistencies and edge cases, overly relying on the language model's capacity to differentiate between committal and non-committal responses. Moreover, the stability of committal scores for generated questions often fluctuates significantly (refer to the picture).
Steps to Reproduce: To observe the described issue, please consider the following data and run it multiple times: Question: 'Why did the intraday reversal occur?' Answer: 'Traders were unclear of the reason, but some noted it could be the softer wage numberin the jobs report.' Contexts: ["Stocks rallied Friday even after the release of stronger-than-expected U.S. jobs data and a major increase in Treasury yields. The Dow Jones Industrial Average gained 195.12 points, or 0.76%, to close at 31,419.58. The S&P 500 added 1.59% at 4,008.50. The tech-heavy Nasdaq Composite rose 1.35%, closing at 12,299.68. The U.S. economy added 438,000 jobs in August, the Labor Department said. Economists polled by Dow Jones expected 273,000 jobs. However, wages rose less than expected last month. Stocks posted a stunning turnaround on Friday, after initially falling on the stronger-than-expected jobs report. At its session low, the Dow had fallen as much as 198 points; it surged by more than 500 points at the height of the rally. The Nasdaq and the S&P 500 slid by 0.8% during their lowest points in the day. Traders were unclear of the reason for the intraday reversal. Some noted it could be the softer wage number in the jobs report that made investors rethink their earlier bearish stance. Others noted the pullback in yields from the day’s highs. Part of the rally may just be to do a market that had gotten extremely oversold with the S&P 500 at one point this week down more than 9% from its high earlier this year. Yields initially surged after the report, with the 10-year Treasury rate trading near its highest level in 14 years. The benchmark rate later eased from those levels, but was still up around 6 basis points at 4.58%. 'We’re seeing a little bit of a give back in yields from where we were around 4.8%. [With] them pulling back a bit, I think that’s helping the stock market,' said Margaret Jones, chief investment officer at Vibrant Industries Capital Advisors. 'We’ve had a lot of weakness in the market in recent weeks, and potentially some oversold conditions.'"]
Suggest approach: In term, answer relevancy could be calculated by the average cosine similarity with the committal question ratio. This approach would aim to reflect the depth of the language model's understanding and the actual similarity between generated and actual questions as well.
Additional context I am prepared to contribute further by creating a pull request (PR) for the suggested approach.
love to know more about this!