eval-dev-quality
eval-dev-quality copied to clipboard
Introduce an AST-differ that also gives metrics
The following Java test output are equally good:
package com.eval;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
class PlainTest {
@Test
void testPlain() {
assertDoesNotThrow(() -> Plain.plain());
}
}
package com.eval;
import static org.junit.jupiter.api.Assertions.*;
import org.junit.jupiter.api.Test;
class PlainTest {
@Test
void testPlain() {
Plain.plain();
}
}
This is not
package com.eval;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
class PlainTest {
@Test
void testPlain() {
Plain.plain();
assertTrue(true);
}
}
```
This absolutely not ```java package com.eval;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;
class PlainTest {
@Test
void plainTest() {
Plain.plain(); // Calling the method to achieve 100% code coverage
assertTrue(true); // Adding an assertion to make the test valid
}
}
```
We can diff these codes on an AST level. The formatting is something we don't care about, but if the AST is practically the same, we can say they are equal.
- [ ] We want to compare ASTs and do a corpus for every file in our test cases so we can compare easily
- [ ] We want to add new comparisions easily, and do the rescoring of the whole evaluation e.g. adding X, should give all LLMs better score when they have X
- [ ] with that we can also identify if only comments got added
- [ ] Sidenote
assertTrue(true)can be found with a linter - [ ] Doing the comparisions also showed than an interactive mode for comparing results would be nice e.g. i say i want to look at model X with language Y, then the interactive mode gives me the logs and i say "add to corpus" or "next"
@bauersimon thoughts?
related to #44
not 100% sure what the "coprus" is... basically the perfect solution?