P.D. Reiter
P.D. Reiter
based on @ChrisTimperley 's feedback issue #300 - Enables a new CLI interface (`darjeeling evaluate`) - tested, but seeing some issues with test timeouts (Currently, `ResourceUsageTracker(limits=None)`, so I may need...
- tested with and without heldout-test content - IMPACT: Changes the TestOutcome, s.t. these two cases: (without heldout tests) and (failing heldout tests) are indiscernible
During assessment, need to extend evaluation infrastructure to support heldout-tests.