eval-dev-quality icon indicating copy to clipboard operation
eval-dev-quality copied to clipboard

Introduce an AST-differ that also gives metrics

Open zimmski opened this issue 1 year ago • 3 comments

The following Java test output are equally good:

package com.eval;

	import org.junit.jupiter.api.Test;

	import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;

	class PlainTest {

	    @Test
	    void testPlain() {
	        assertDoesNotThrow(() -> Plain.plain());
	    }
	}
package com.eval;

	import static org.junit.jupiter.api.Assertions.*;

	import org.junit.jupiter.api.Test;

	class PlainTest {

	    @Test
	    void testPlain() {
	        Plain.plain();
	    }
	}

This is not

	package com.eval;

	import org.junit.jupiter.api.Test;
	import static org.junit.jupiter.api.Assertions.*;

	class PlainTest {

	    @Test
	    void testPlain() {
	        Plain.plain();
	        assertTrue(true);
	    }
	}
	```

This absolutely not ```java package com.eval;

import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

class PlainTest {

    @Test
    void plainTest() {
        Plain.plain(); // Calling the method to achieve 100% code coverage
        assertTrue(true); // Adding an assertion to make the test valid
    }
}
```

We can diff these codes on an AST level. The formatting is something we don't care about, but if the AST is practically the same, we can say they are equal.

  • [ ] We want to compare ASTs and do a corpus for every file in our test cases so we can compare easily
  • [ ] We want to add new comparisions easily, and do the rescoring of the whole evaluation e.g. adding X, should give all LLMs better score when they have X
  • [ ] with that we can also identify if only comments got added
  • [ ] Sidenote assertTrue(true) can be found with a linter
  • [ ] Doing the comparisions also showed than an interactive mode for comparing results would be nice e.g. i say i want to look at model X with language Y, then the interactive mode gives me the logs and i say "add to corpus" or "next"

zimmski avatar Apr 28 '24 16:04 zimmski

@bauersimon thoughts?

zimmski avatar Apr 28 '24 16:04 zimmski

related to #44

bauersimon avatar Apr 29 '24 11:04 bauersimon

not 100% sure what the "coprus" is... basically the perfect solution?

bauersimon avatar Apr 29 '24 11:04 bauersimon