lm-evaluation-harness add context-based requests processing

The PR add support for new type of tasks - context-based tasks.

Motivation. Some tasks and CoT strategies may require knowing the answer of the model for the previous question to form the current request. Till now it is impossible to implement such tasks without changing evaluator.py (or models/*.py file) which leads to inability to use lm-evaluation-harness as an external library (user cannot directly pass new evaluator.py instead of the default one while running the task). This PR changes it.

How it works. All requests are split into two meta-groups: regular tasks and context-based tasks. Each group is processed separately. No changes in processing regular tasks. For context-based tasks after preparing requests each request is updated, processed through the model, the external storage is updated. If no context-based tasks claimed, the loop for processing them is not accessed. So, the workflow for all existing tasks has not been changed. Also, to encompass new functionality a new instance class is added: ContextInstance. It inherits from regular Instance and adds only two new methods: update_request that takes storage and request and does something to the request right before passing it into the model so that the changes are available with --log_samples flag. Old tasks that use Instance are not affected. New class is meant to avoid confusion between regular and context-based tasks instances. To indicate that task is context-based the new attr is used. It shouldn't be False by defalut, as while running all tasks the presence of this attr and its value is checked. So, no changes needed to run existing tasks, no way old tasks will be run through a new loop.

All tests are passed successfully. No need in changes for different models. The only problem that may happen: each time while calling the model a new progress bar apears. This can be solved by merging #1569

Closes: #1432 #1537

Mar 13 '24 10:03 artemorloff

@uanu2002

Mar 13 '24 11:03 artemorloff

It is also important to define the behaviour of ConfigurableTask so that user could define context-based task with yaml file, isn't it?

Mar 13 '24 14:03 artemorloff

I have added support for yaml tasks format. Now context-based task can be passed via task.yaml file. ContextInstance is called only for context-based tasks, that are defined by specific flag inside the task. Old tasks are not affected and use original Instance class.

Mar 15 '24 12:03 artemorloff

@haileyschoelkopf is there anything I can do to enhance the PR and speed up the merging process? If there is anything I can add to make the PR look better, the code work more efficiently or avoid any confusion I will do it. I am open for ideas and comments :)

Mar 25 '24 00:03 artemorloff

@haileyschoelkopf merged recent updates from main so that it was easier to review the changes I'm proposing

Apr 03 '24 11:04 artemorloff

Left a comment about this PR and the feature in discord!

Apr 07 '24 16:04 haileyschoelkopf

Some tasks and CoT strategies may require knowing the answer of the model for the previous question to form the current request.

is this primarily about multi-step / multi-round problems?

Apr 18 '24 12:04 StellaAthena

@StellaAthena hi! Yes, when first introducing these changes I was thinking about multi-step prompting things. Like described here: https://arxiv.org/pdf/2305.14279.pdf But I wanted to make the solution more flexible to handle multi-round (like here https://arxiv.org/pdf/2311.07689.pdf may be). The main idea is to let users define theirselves the number of steps/rounds, the insides of requests passing the result for the previous request to the current one.

Apr 18 '24 14:04 artemorloff

@haileyschoelkopf can you give more information so that I could develope this PR? :)

May 06 '24 14:05 artemorloff

This PR is designed to accomodate primarily different variants of multi-step and multi-round tasks. This way it suggests flexible functions to update the internal storage (keeps info about previous requests of the dataset) and to update the currect request (takes info from storage). With minor changes can close #1816 issue (need to envisage managing the amount of requests so that user can add more requests following some condition for it without affecting other context-based tasks).

I see it working this way. Eample of yaml task:

# download the task
task: dataset_y
dataset_path: dataset
dataset_name: dataset_name
# define that it is multiple-choice to compute log-probs
output_type: multiple_choice
# have only train and test splits
training_split: train
test_split: test
# new flag indicating it is context-based task
context_based: true
# methods used to update requests and storage
request_updater: !function utils._update_request
storage_updater: !function utils._update_storage
# for this task process_docs ensures the order of the instances
process_docs: !function utils.process_docs
# doc['instruction'] has place to put context inside
doc_to_text: "{{doc['instruction']}}"
doc_to_target: "{{outputs}}"
doc_to_choice: ["1", "2"]
target_delimiter: " "
should_decontaminate: false
process_results: !function utils.process_results
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0

May 13 '24 22:05 artemorloff

Examples of funcs:

def _update_request(storage, request):
    if not len(storage) and request.doc["meta"]["q_id"] != 0:
        print("No previous responses logged in storage!")
        return request
    if request.doc["meta"]["q_id"] == 0:
        # no update for first request
        update_ctx = ""
    else:
    # take context from storage
        update_ctx = storage["string"]
    # create new args for request to pass in lm and be logged in jsonl file
    new_pair = (
        request.arguments[0].replace(CONTEXT_PLACEHOLDER, update_ctx),
        request.arguments[1],
    )
    request.arguments = new_pair
    return request

def _update_storage(storage, request):
    # check that the set is over to clear storage
    if (
        request.doc["meta"]["set_id"] == 0
        and request.doc["meta"]["q_id"] == 429
        and len(storage["candidates"]) == 1
    ):
        dataset_ends = True
    else:
        dataset_ends = False
    # clear storage after dataset ends and return
    if dataset_ends:
        return {}
    # update storage only after running 2 choices for the same req
    storage.setdefault("candidates", []).extend([request.resps[0][0]])
    if len(storage["candidates"]) == 2:
        # decide on the answer
        res = ["1", "2"][np.argmax(storage["candidates"])]
        # get string that includes the context
        storage["string"] = storage.get("string", "")
        # update the previous context with the new one and answer
        storage[
            "string"
        ] += "\n{question}\n1. {choice1}\n2. {choice2}\nОтвет: {result}".format(
            question=request.doc["inputs"]["question"],
            choice1=request.doc["inputs"]["choice1"],
            choice2=request.doc["inputs"]["choice2"],
            result=res,
        )
        # discard storage each time all choices of req are passed
        storage["candidates"] = []
    return storage

May 13 '24 22:05 artemorloff

@haileyschoelkopf i would gladly incorporate my ideas in existing lm-harness features!

May 13 '24 22:05 artemorloff

@artemorloff I like the general idea but I'm missing some context here. Could describe a few types of tasks that would benefit from this type of evaluation (I've seen a recent issue that provided an example, but I'm wondering if there are other types of tasks that would benefit from this pattern)

May 14 '24 03:05 lintangsutawika

@lintangsutawika thank you for reply! there emerging a lot of works and tasks that use multi-step/multi-round strategy. The goal is to allow users right now start creating these tasks and find surely existing ways to enhance it. Some examples of tasks and ideas using this strategy:

arithmetic task described here - https://arxiv.org/pdf/2305.14279,
CoT for benchmerk tasks with two-step questions [first model generates rationale, then final answer being prompted with prev generated rationale] - https://arxiv.org/pdf/2311.12022
even multi-round stuff like this (https://arxiv.org/pdf/2311.07689) can be done with a few changes in code [the idea is in regenerating the answer for the same request without adding new requests to the list - run as many rounds as one may want and update instance resps with maybe the last try or the most successfull one]
even quite hard things like this (https://arxiv.org/pdf/2402.08702v2) i guess may be done though lm-harness [this tasks of prompt tuning seems more multi-round then multi-step, but the idea is that one can even train something with harness if he manages to cover backward and step in post processing of each round]
same idea of CoT that makes a model to think on the task - https://arxiv.org/pdf/2403.14312v1 [originally a few llms "discuss" the task, in my realisation one model can judge in its own previous "thoughts"]

all this works refer to other works, so I think multi-step (round) reasoning is quite important and may allow for new tasks in LM-harness.

May 17 '24 13:05 artemorloff

my PR introduces a quite straightforward approach to the issue. All such tasks are processed one by one to avoid possible problems with batches failing GPU memory, multiple batch size computations. I believe lm-harness may be a good tool to conduct such researches to develop CoT resoning through multi-step (round) strategy.

May 17 '24 13:05 artemorloff

@lintangsutawika pushed update from main to actualise PR. What do you think about it? Is there something I can provide or push to make this PR look better for merging?))

Jun 03 '24 21:06 artemorloff

Thanks. @haileyschoelkopf and I will take a look after we wrap up out the Neurips Datasets and Benchmark deadline.

Jun 05 '24 04:06 lintangsutawika

All committers have signed the CLA.

Jul 22 '24 17:07 CLAassistant

lm-evaluation-harness lm-evaluation-harness copied to clipboard

add context-based requests processing

lm-evaluation-harness
lm-evaluation-harness copied to clipboard