evals icon indicating copy to clipboard operation
evals copied to clipboard

Any website where I can share evaluation results?

Open pocca2048 opened this issue 1 year ago • 7 comments

Describe the feature or improvement you're requesting

Hi.

I was wondering if there is any websites where I can share and see others' evaluation results. Should I run every 'eval' locally by myself to see the accuracy of 'eval' made by other people? Maybe I am missing but I think it would be good if there is a website like llm_leaderboard for evals too.

Thanks

Additional context

If someone wants to share their eval results, do something like:

oaieval gpt-3.5-turbo test-match --submit=True

pocca2048 avatar Jun 15 '23 04:06 pocca2048

That’s a great idea. I have been thinking of starting a website for sharing and comparing evaluation results, but I haven’t found anyone who is interested in collaborating with me. Do you want to join me in this project? I think it would be very useful for the OpenAI community. The website could be called openaievals.com and it could be a valuable resource for managers and other stakeholders who want to use AI to replace humans and need to understand the limitations of AI. We could also provide feedback and annotations for the eval failures to help understand the models.

jjyuhub avatar Jun 21 '23 14:06 jjyuhub

I like the idea as well. It would be good to have prior results accessible publicly and not to require people to spend money again on reproducing them.

A few things to consider when designing such a features that came spontaneously to my mind:

  • Models change over time: Some model names are aliases, e.g. gpt-4 currently points to gpt-4-0314 and will soon point to gpt-4-0613. Results will differ across versions. Submitted results should always contain the exact model version.
  • Datasets can change over time: It would be good to include some sha, or commit id or versioning identifier for the datasets used in the submitted evaluation.
  • The evaluation code changes over time: It would be good to include the package version / commit id of the running environment.
  • Evaluations can fail: Only those without errors should get submitted.
  • Evaluations are not deterministic: Running the same evaluation repeatedly might give different results. (Even setting temperature to 0 does not guarantee the same output.) Thus such a result page should be able to show many (hundreds of) evaluations for the same set and aggregate this in a helpful way, avg. best. etc.

peldszus avatar Jun 22 '23 14:06 peldszus

Thank you for your reply. I liked and agreed with most of the suggestions, except for the last one. I think average/best/worst results might not be the most optimal metric for Chain of Thoughts prompting results, as I discussed once with some people from Weights&Biases, explainable AI failures can contain interesting insights with distinguishing features that could be clustered and annotated for this form of “non-deterministic artifact evaluation”.


From: Andreas Peldszus @.> Sent: Thursday, June 22, 2023 4:29:58 PM To: openai/evals @.> Cc: jjyuhub @.>; Comment @.> Subject: Re: [openai/evals] Any website where I can share evaluation results? (Issue #1166)

I like the idea as well. It would be good to have prior results accessible publicly and not have the need to spend money again on reproducing them.

A few things to consider when designing such a features that came spontaneously to my mind:

  • Models change over time: Some model names are aliases, e.g. gpt-4 currently points to gpt-4-0314 and will soon point to gpt-4-0613. Results will differ across versions. Submitted results should always contain the exact model version.
  • Datasets can change over time: It would be good to include some sha, or commit id or versioning identifier for the datasets used in the submitted evaluation.
  • The evaluation code changes over time: It would be good to include the package version / commit id of the running environment.
  • Evaluations can fail: Only those without errors should get submitted.
  • Evaluations are not deterministic: Running the same evaluation repeatedly might give different results. (Even setting temperature to 0 does not guarantee the same output.) Thus such a result page should be able to show many (hundreds of) evaluations for the same set and aggregate this in a helpful way, avg. best. etc.

— Reply to this email directly, view it on GitHubhttps://github.com/openai/evals/issues/1166#issuecomment-1602743855, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACFF37PIYTGWNAQQUK6ICUTXMRJGNANCNFSM6AAAAAAZHI2JRM. You are receiving this because you commented.Message ID: @.***>

jjyuhub avatar Jun 22 '23 15:06 jjyuhub

Wanted to share (owners feel free to delete) —you can run some OpenAI evals and share results using this project we’ve worked on: https://github.com/ianarawjo/ChainForge On the web version https://chainforge.ai/play/, just click Share and it’ll generate a unique link. There’s OpenAI evals you can start from as example evaluations.

ianarawjo avatar Jul 06 '23 17:07 ianarawjo

@jjyuhub I believe this to be an excellent proposal. My primary interest lies in the analysis and identification of the root causes behind these evaluation results. I'm confident that this website will significantly contribute to our community, stimulating substantial discussion and engagement among others.

yayachenyi avatar Jul 07 '23 09:07 yayachenyi

@jjyuhub I believe this to be an excellent proposal. My primary interest lies in the analysis and identification of the root causes behind these evaluation results. I'm confident that this website will significantly contribute to our community, stimulating substantial discussion and engagement among others.

I’m happy you like the idea. I’m happy to discuss it with you further when you have time. How about we schedule a meeting one day to discuss it further?

I’m currently considering about the technology stack:

For the backend:

Python: Because the OpenAI Evals tool is written in python, because there are a large amount of python based data science libraries and because OpenAI’s codex and GPT-4 were trained primarily with python in mind and seem to be more reliable in fixing bugs in Python than in other programming languages.

Django Python over Flask Python, unlike ChainForge: Because I prefer the Django’s built-in admin panel. It allows me to easily manage the data and the users of the website, without having to install additional packages. I also think that GPT-4 would learn better from the 307K Django Stackoverflow questions than the 53K Flask ones, as they cover more topics and scenarios related to web development.

Then, as a side note, I also like to be more focused on the root causes for different results for the same prompt instead of the differences between different prompts and I also like to put more focus towards proposals on how to advance the state of the art towards AGI.

PostgreSQL: Because I like its wide range of data types support such as JSON to store the Evals and the native array data type for storing the category tags for the Evals.

For the frontend: JavaScript (codex.js 😉): For the visualisations on the frontend, unlike ChainForge, I think I’ll let GPT-4 randomly generate visualisations using a wide range of JavaScript libraries such as d3.js, chart.js, react and so, and then let testers rank it from the worst to the best visualisations for optimal results. Then with a Mobile First design in mind. Unlike ChainForge, which seems to block mobile browsers for now, which is annoying as many users are on mobile.

Please let me know if you have any questions or suggestions. I appreciate your feedback and support.

jjyuhub avatar Jul 07 '23 11:07 jjyuhub

Hello Everyone! :wave:

I'm thrilled to introduce a new initiative that I believe will be beneficial for our community.

I have been working on the oaievals-collector project, a tool designed to streamline the process of evaluations. The project has been designed with Kafka, InfluxDB, Loki, and TimeScaleDB (PostgreSQL Timeseriesed) in mind to reduce the barrier to entry.

To enhance its functionality, I've added a PR which enables exporting to an http endpoint. You can check out the PR here: PR Link.

This initiative aims to compile and share evaluation results with the wider community in a user-friendly manner. By offering an easy way to contribute and view evaluation data, we can better understand and improve our models. :rocket: :bar_chart:

I encourage everyone to explore the oaievals-collector repository, leave comments on the PR, and try out the tool. Your thoughts, feedback, and questions are more than welcome!

CC: @jjyuhub, @ianarawjo

Looking forward to hearing from you!

Cheers, :beers:

nstankov-bg avatar Jul 11 '23 20:07 nstankov-bg