Assist with dot evals
What is it?
Help Carl define an exhaustive list of evals on dot.
I'll pull inspiration from some of the eval datasets mentioned in the papers we've read (BIRD, SPYDER).
@ccerv1 Let me know when you can meet so we can sync. I'm currently putting together some basic evals based on what I see you already have in dot, and some of my own research, but I want to make sure we don't overwrite each other by accident
I went ahead and created a spreadsheet for now where we can store our evals? here
I'm just using a pretty simple categories system so that we can keep track of where our evals are focused and ensure we built an exhaustive set. I added yours from dot into the spreadsheet, and then added some of my own. I can easily add them into dot as well once we give the "ok" on any of them, I just didn't want to add them in before talking to you!
It was inspired by:
- https://bird-bench.github.io/
- https://spider2-sql.github.io/
Happy to move it to another medium as well, I figured this would be good enough in the interim.
Nice! I like the categorization. Is there anyway we can just put this into Phoenix so that there's only 1 source of truth? This will become out of date quickly. Let's discuss this week. In the meantime, can you move the sheet into the Team shared drive?