Julien Cornebise
Julien Cornebise
TLDR sanity check "evaluation": - Valid HTML - Has 5 paragraphs - Has a title - Each paragraph has its own individual eval, which is in the prompt itself a...
Examples of stability accross calls to Claude 3.5 Sonnet on the [Bowling Green report](https://pol.is/report/r2xcn2cdbmrzjmmuuytdk):     and on New Zealand report 
Example of Gemini Advanced output not meeting the expectation (with the caveat that the prompt was developed against Claude 3.5 Sonnet, so not entirely shocking it doesn't port):  And...
Feedback from @DZNarayanan on the stability of evaluations in the Bowling Green example: while it appears stable to a general eye, for a specialist who knows that conversation quite well...
Great, thanks! On Sat, Jun 29, 2024, 22:51 sepro ***@***.***> wrote: > Thank you for your PR. I have added the fix to the general clean-up PR, > which will...
For context: I was (thinking I was) using this functionality when tracing computations as part of #1893, to make sure I was using only the voting matrix regardless of whatever...
You absolutely do, they're really fun to read, and show the passion that went into the code! It's a treat :) Thanks for all the info and the quick reply!...
Thanks @michielbakker ! Agreed. I've added a simple test for the shapes with some specific values, and will finish later today adding the actual check that the p-values match up.
Yeaaah, so, I'm adding the p-values check, and I'm getting some "p-values" > 1 😆 That's because , in the notebook code, the actual p-values get multiplied by $R_v(g,c)$ before...
As I mentioned, I do not vouch for the math nor the implementation. But it's _a_ number. It computes _something_, that is technically shippable if we put a Docker container...