oso icon indicating copy to clipboard operation
oso copied to clipboard

Assist with dot evals

Open evanameyer1 opened this issue 8 months ago • 4 comments

What is it?

Help Carl define an exhaustive list of evals on dot.

I'll pull inspiration from some of the eval datasets mentioned in the papers we've read (BIRD, SPYDER).

evanameyer1 avatar Apr 30 '25 22:04 evanameyer1

@ccerv1 Let me know when you can meet so we can sync. I'm currently putting together some basic evals based on what I see you already have in dot, and some of my own research, but I want to make sure we don't overwrite each other by accident

evanameyer1 avatar May 01 '25 22:05 evanameyer1

I went ahead and created a spreadsheet for now where we can store our evals? here

I'm just using a pretty simple categories system so that we can keep track of where our evals are focused and ensure we built an exhaustive set. I added yours from dot into the spreadsheet, and then added some of my own. I can easily add them into dot as well once we give the "ok" on any of them, I just didn't want to add them in before talking to you!

It was inspired by:

  • https://bird-bench.github.io/
  • https://spider2-sql.github.io/

Happy to move it to another medium as well, I figured this would be good enough in the interim.

evanameyer1 avatar May 02 '25 02:05 evanameyer1

Nice! I like the categorization. Is there anyway we can just put this into Phoenix so that there's only 1 source of truth? This will become out of date quickly. Let's discuss this week. In the meantime, can you move the sheet into the Team shared drive?

ryscheng avatar May 04 '25 19:05 ryscheng