litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Harcoded incorrect (and repeated) validation example

Open DavidGOrtega opened this issue 2 years ago • 20 comments

👋 I have found this code to be surprisingly incorrect as my dataset has nothing to do with this validation example. Its also repeated in every single finetuning script.

# produce an example:
    instruction = "Recommend a movie for me to watch during the weekend and explain the reason."
    fabric.print(instruction)
    sample = {"instruction": instruction, "input": ""}
    prompt = generate_prompt(sample)
    encoded = tokenizer.encode(prompt, device=fabric.device)
    with fabric.init_tensor():
        # do not set `max_seq_length=max_returned_token` because memory is not a concern here
        model.set_kv_cache(batch_size=1)
    output = generate(model, encoded, max_returned_tokens=len(encoded) + eval_max_new_tokens, temperature=0.8)
    model.clear_kv_cache()
    output = tokenizer.decode(output)
    fabric.print(output)

DavidGOrtega avatar Dec 06 '23 17:12 DavidGOrtega

This piece of validation was useful to subjectively evaluate the output of a fine-tuned model on the alpaca dataset as training progressed. Reusing it is on purpose so that you can see how it evolves with more steps.

You are correct that it might be completely irrelevant for other datasets. Feel free to completely remove those lines from your script or change the instruction to be more relevant to your use case.

carmocca avatar Dec 07 '23 12:12 carmocca

why not just take it from your test split?

DavidGOrtega avatar Dec 07 '23 13:12 DavidGOrtega

You could do that, but then it might be different if you shuffle or change your dataset splits

carmocca avatar Dec 08 '23 17:12 carmocca

When I added that prompt to the scripts, I was legit actually looking for a movie to watch and that's why it is there.

awaelchli avatar Dec 09 '23 02:12 awaelchli

You could do that, but then it might be different if you shuffle or change your dataset splits

we are indeed randomly shuffling the dataset when preparing it but the seed is always the same in every prepare script 😉

When I added that prompt to the scripts, I was legit actually looking for a movie to watch and that's why it is there.

Not questioning that sir but it took me a bit to check that I was not messing with the datasets 😁

DavidGOrtega avatar Dec 09 '23 08:12 DavidGOrtega

@DavidGOrtega What do you think we could do? The requirement is that the lit-gpt scripts stay as simple as possible. We want people to read the entire script and understand it in and out. The scripts are more like templates that the user adapts to their need. So we wouldn't want to add too much code here to select a sample from the dataset directly. As Carlos said that would require instantiating the dataset separately.

Maybe we could add a comment above the code that selects the sample, something like "Fixed example prompt as a sanity check to compare across validation runs, please adapt to your data." ?

I would also be fine with removing it if it leads to too much confusion. I just thought that printing some output as a sanity check could help.

awaelchli avatar Dec 10 '23 12:12 awaelchli

Why not just take one example from the eval split? I know that Im pretty silly but it took me a couple of minutes to figure out what was happening in the file.

The requirement is that the lit-gpt scripts stay as simple as possible.

Unfortunately, I find the scripts pretty difficult to understand by obscurity... In example: the preparation scripts

    full_prompt = generate_prompt(example)
    full_prompt_and_response = full_prompt + example["output"]
    encoded_full_prompt = tokenizer.encode(full_prompt, max_length=max_length)
    encoded_full_prompt_and_response = tokenizer.encode(full_prompt_and_response, eos=True, max_length=max_length)

    # The labels are the full prompt with response, but with the prompt masked out
    labels = encoded_full_prompt_and_response.clone()
    if mask_inputs:
        labels[: len(encoded_full_prompt)] = ignore_index

return {
        **example,
        "input_ids": encoded_full_prompt_and_response,
        "input_ids_no_response": encoded_full_prompt,
        "labels": labels,
    }

input_ids_no_response is never used by any finetune script, nor the rest of the example. Should that block of code stay there? If not people might be concerned of setting it, I find also the bos and eos not perfectly explained nor implemented.

IMHO I would have a base preparation script and specific parsing functions and remove all the dead code. Also I see too much code repeated there and that might be difficult to be maintained.

Said that the first thing to understand is that Im pretty silly and I might have problems where others not! 🙏

DavidGOrtega avatar Dec 13 '23 20:12 DavidGOrtega

Any thought here @awaelchli ?

DavidGOrtega avatar Dec 15 '23 13:12 DavidGOrtega

My thoughts are feel free make a concrete suggestion in form of a PR regarding the sample, or remove it entirely. We appreciate and encourage active contributions if they are within the project's scope.

Unfortunately, I find the scripts pretty difficult to understand by obscurity...

A human error of leaving one redundant line of code in there is nowhere near anything like a deliberate act of obfuscation or anything like that. Let's just remove this line

"input_ids_no_response": encoded_full_prompt,

and the world is in balance again. If you see dead code elsewhere please call it out but we need some concrete evidence first that there are real problems with the code being obscure and hard to understand (like what else is there that you are hinting at?). Of course, lit-gpt is an NLP-focused template so certain terminology is expected to be known. But by all means if we think that terms like eos, bos are too abbreviated, let's add a comment to explain them at the appropriate places!

The scripts being versions of each other is intentional. Lit-GPT actively tries not to be a framework, and thus favors single-file templates over abstractions, even if that means certain blocks of code are repeated. Think of them as templates!

awaelchli avatar Dec 16 '23 00:12 awaelchli

If you see dead code elsewhere please call it out but we need some concrete evidence first that there are real problems with the code being obscure and hard to understand (like what else is there that you are hinting at?).

Im pointing it out just because that small section makes you think and probably the majority will replicate what its not needed (not my case no worries 🙂 ). Thats against the simplicity objective that we want to achieve in the scripts, isn't it?

The obscurity is related not to obfuscation but one of the main issues that many ML projects have: knowledge transfer technical debt. What do I mean with this? normally we re-implement a paper and we do it with all the issues and errors that the original coder had and we never question it just because we tend to think that the code is done by someone smart, but that does not mean that the code is still good enough or they guy was just careful. He has lots of things to do, and the one who is in charge of distillate that properly are the ones that bring that knowledge into something usable. Im not saying that we have that here because I haven't read the code that inspired this if any 🙂

P.D. eos bos True or False is quite limited? I mean not necessarily has to be the tokenizer's. In example it could be <s> </s>, right? But we can discuss that in another issue 🙂

DavidGOrtega avatar Dec 18 '23 09:12 DavidGOrtega

What does that mean concretely in lit-gpt? I'm still missing concrete examples. I'd like to talk specifics to make progress on the issue.

awaelchli avatar Dec 18 '23 14:12 awaelchli

I can do a PR if you are able to accept it. My proposal for this issue is to take one example from the eval split

DavidGOrtega avatar Dec 18 '23 14:12 DavidGOrtega

Happy to take a look at your proposal. Since the data comes pre-tokenized, you'll have to change the code to decode it (to print it for the human).

awaelchli avatar Dec 19 '23 14:12 awaelchli

Happy to hear! Let me prepare something.

DavidGOrtega avatar Dec 19 '23 21:12 DavidGOrtega

Did anything ever come of this issue/PR? I notice that the hardcoded example is still in the code, and I also ran into it as a bug.

rhiever avatar Mar 31 '24 23:03 rhiever

This has come up several times already and people are often confused by it (see also https://github.com/Lightning-AI/litgpt/pull/1179#discussion_r1536989995). What if we simply remove this bit @rasbt, @awaelchli? This is also a challenge for thunder because generation has a prefill and next-token stages with different shapes

carmocca avatar Apr 02 '24 17:04 carmocca

How about we add a "validate_prompt: Your text" to the configs but leave it empty (disable) it by default. This way users can choose to use it because it can sometimes be useful, but it will be off by default to not confuse people.

rasbt avatar Apr 02 '24 17:04 rasbt

That works but the counterargument is to keep things simple. If we won't have it by default then I wouldn't have it. But if we don't want anybody having to write any code then we should have it.

carmocca avatar Apr 02 '24 17:04 carmocca

But if we don't want anybody having to write any code then we should have it.

For my part, I actually like seeing the text generation for a fixed prompt during training because it's useful, so I'd like to have that option in the config files to (re)enable it for selfish reasons 😅

rasbt avatar Apr 02 '24 17:04 rasbt

IMO the proposed solution to use a fixed query from the test split makes a lot of sense.

rhiever avatar Apr 02 '24 17:04 rhiever