lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

Serbian LLM Benchmark Task

Open DeanChugall opened this issue 4 months ago • 5 comments

Serbian LLM Benchmark Task Configuration and Prompt Functions

Summary:

This pull request introduces task configurations and prompt functions for evaluating LLM models on various Serbian datasets. The module includes tasks for:

ARC (easy and challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, custom OZ Eval dataset.

The tasks are defined using the LightevalTaskConfig class, and prompt generation is streamlined through a reusable serbian_eval_prompt function.

Changes:

  1. Task Configurations:

    • Configurations for ARC (Easy and Challenge), BoolQ, Hellaswag, OpenBookQA, PIQA, Winogrande, and OZ Eval tasks using LightevalTaskConfig.
    • Enum class HFSubsets added for dataset subset management, improving code maintainability and clarity.
    • create_task_config function allows dynamic task creation with dependency injection for flexibility in dataset and metric selection.
  2. Prompt Functions:

    • The serbian_eval_prompt function creates a structured multiple-choice prompt in Serbian.
    • The function supports dynamic query and choice generation with configurable tasks.
  3. Logging:

    • A hello_message banner is printed upon task initialization, listing all available tasks.
    • Task names are dynamically generated and printed using hlog_warn.

Key Features:

  • Modular Design: Task configurations are modular, reusable, and easily extendable to accommodate new datasets and tasks.
  • Improved Readability: Introduction of the HFSubsets Enum improves the readability and maintainability of the dataset subset references.
  • Enhanced Flexibility: create_task_config function simplifies task creation, promoting cleaner and more maintainable code.
  • Clear Logging: Logging includes a friendly welcome message and a list of available tasks for easier debugging and interaction.

Future Enhancements:

  • Additional prompt functions can be added for different task types.
  • Unit tests should be written to ensure the integrity of prompt generation and task configuration.

DeanChugall avatar Oct 03 '24 15:10 DeanChugall