Julien Cornebise

Results 133 comments of Julien Cornebise

TLDR sanity check "evaluation": - Valid HTML - Has 5 paragraphs - Has a title - Each paragraph has its own individual eval, which is in the prompt itself a...

Examples of stability accross calls to Claude 3.5 Sonnet on the [Bowling Green report](https://pol.is/report/r2xcn2cdbmrzjmmuuytdk): ![image](https://github.com/user-attachments/assets/75bc82c3-6ae0-4a9e-969b-3198800864f1) ![image](https://github.com/user-attachments/assets/67995d1d-85a1-47a7-bce2-037cfcf2b6e2) ![image](https://github.com/user-attachments/assets/64fc2e53-c21d-43b9-857d-d17561ae5337) ![image](https://github.com/user-attachments/assets/c4b381d6-a021-4cff-b67b-c795f8a47af9) and on New Zealand report ![image](https://github.com/user-attachments/assets/ced14fae-fac8-4735-97e4-3528ec483329)

Example of Gemini Advanced output not meeting the expectation (with the caveat that the prompt was developed against Claude 3.5 Sonnet, so not entirely shocking it doesn't port): ![image](https://github.com/user-attachments/assets/bd9c0a4b-aa1b-4693-ab52-326b9dcee073) And...

Feedback from @DZNarayanan on the stability of evaluations in the Bowling Green example: while it appears stable to a general eye, for a specialist who knows that conversation quite well...

Great, thanks! On Sat, Jun 29, 2024, 22:51 sepro ***@***.***> wrote: > Thank you for your PR. I have added the fix to the general clean-up PR, > which will...

For context: I was (thinking I was) using this functionality when tracing computations as part of #1893, to make sure I was using only the voting matrix regardless of whatever...

You absolutely do, they're really fun to read, and show the passion that went into the code! It's a treat :) Thanks for all the info and the quick reply!...

Thanks @michielbakker ! Agreed. I've added a simple test for the shapes with some specific values, and will finish later today adding the actual check that the p-values match up.

Yeaaah, so, I'm adding the p-values check, and I'm getting some "p-values" > 1 😆 That's because , in the notebook code, the actual p-values get multiplied by $R_v(g,c)$ before...

As I mentioned, I do not vouch for the math nor the implementation. But it's _a_ number. It computes _something_, that is technically shippable if we put a Docker container...