lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

confuse about log_samples : Option to preserve thinking/reasoning traces in model outputs when using --log_samples

Open hhh2210 opened this issue 8 months ago • 9 comments

Problem Description

When using DeepSeek-R1 and other reasoning models that generate thinking chains, the framework automatically strips content from the outputs. Specifically:

In the vLLM backend, it calls the postprocess_generated_text function This function removes thinking content based on the think_end_token parameter: Similarly, the HuggingFace backend has the same processing:

Desired Feature

I would like to request adding an option (such as --preserve_thinking or a sub-option within --log_samples) that allows users to choose to retain the complete model output, including chain-of-thought content, maily for academia research :Analyzing model reasoning processes, Studying thought chain quality,Debugging and enhancing prompt engineering

Potential Solutions

Add a new command-line parameter to control whether thought content is preserved Save both raw and processed outputs when logging samples Provide an option to not set the think_end_token parameter

Definition of think_end_token parameter: vllm_causallms.py:140 Post-processing function: utils.py:857-883

hhh2210 avatar Jul 31 '25 03:07 hhh2210

If possible, I can write the code myself, just make sure there's no other way to solve the COT trace issue

hhh2210 avatar Jul 31 '25 03:07 hhh2210

If possible, I can write the code myself, just make sure there's no other way to solve the COT trace issue

Hi! This does sound like something we should support, though the integration is a bit tricky with the current architecture. A quick workaround could be to cache the original generations here (use with --use_cache), rather than after they are truncated. Similar logic for vllm. Cache is a simple sqlitedict defined here, and you can use the tuple of the args from the sample file as keys to map them back.

baberabb avatar Jul 31 '25 13:07 baberabb

Thanks for your reply! I can test this solution on weekends.

Hi! This does sound like something we should support, though the integration is a bit tricky with the current architecture. A quick workaround could be to cache the original generations here (use with --use_cache), rather than after they are truncated. Similar logic for vllm. Cache is a simple sqlitedict defined here, and you can use the tuple of the args from the sample file as keys to map them back.

But also want to ask, if I directly use the old version of lm_eval or delete the "strip" related functions, would this affect the model evaluation results(e.g., Exact Match metrics)? @baberabb

hhh2210 avatar Aug 01 '25 09:08 hhh2210

I found the“ --cache” option a bit inconvenient as it involves databases handling. I modified the code using a different jsonl approach. The details are at: https://github.com/EleutherAI/lm-evaluation-harness/pull/3204 How about that?

hhh2210 avatar Aug 03 '25 04:08 hhh2210

Hello, sorry to bother you, any updates or advice? @baberabb I'd like to know what the developer team thinks about this issue, and I'm very eager to participate in the subsequent feature development and implementation

hhh2210 avatar Aug 08 '25 07:08 hhh2210

Hello everyone, @baberabb @hhh2210 I'd also like to express my support for developing this feature. May I ask if there are any plans to add this to the development schedule in the near future?

Co-Cl2 avatar Nov 02 '25 02:11 Co-Cl2

nope, i develop a hf-transformer version kind of works, you can check it at https://github.com/EleutherAI/lm-evaluation-harness/pull/3204

hhh2210 avatar Nov 03 '25 07:11 hhh2210

@hhh2210 Thank you! But I have temporarily implemented this logic on my own at the moment, because I also need to solve another problem revealed in #3382 at the same time.

Co-Cl2 avatar Nov 04 '25 03:11 Co-Cl2

@Co-Cl2 ok, what's your solution? i think you can make the folk public and maybe raise a PR(no matter the code owner have the time to review)

hhh2210 avatar Nov 04 '25 03:11 hhh2210

@hhh2210 #3386 This is the PR of the solution I have conceived. Thank you for your suggestions.

Co-Cl2 avatar Nov 07 '25 07:11 Co-Cl2

@Co-Cl2 Got it! Could you share a few more details on how you handled the logging behavior fix? Super curious about what approach you took and whether it affects metrics caculatro or caching interacts with the outputs.

hhh2210 avatar Nov 08 '25 05:11 hhh2210

@hhh2210 Here is a description of the changes I made, as detailed in my PR:

I have deprecated the original think_end_token parameter. Instead, I now pass both think_start_token and think_end_token into the task's config via metadata.

The logic for stripping the Chain-of-Thought (CoT) is now executed within the filter section, using these parameters from the config.

As a result of this implementation:

filtered_resp now holds the result after the CoT stripping and the answer-extraction filter have been applied.

resp remains in its state before the filter logic is executed. (If the original think_end_token parameter was not set, resp is just the raw model output).

This implementation is just for reference; I don't think it's a very good approach 🙂.

In addition, since this behavior is implemented directly within the filter logic, I don't believe it will affect how caching interacts with the outputs.

Regarding the metrics calculator, it is based on the filtered_resp. Based on my observations, the filtered_resp is being generated in the correct format, so the calculations should be accurate.

Co-Cl2 avatar Nov 09 '25 04:11 Co-Cl2