Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates
Update 2024-12-11
I have found that when running the Aider benchmark with the DeepSeek-Coder-6.7B-Instruct model, most of the results generated by the model did not include the SEARCH/REPLACE blocks which is used by the benchmarking program to save the code into Python source files and run unit tests. See this comment.
Original post on 2024-11-29
I got some extraordinary low results on running Aider benchmark, with the DeepSeek-6.7B-Instruct model. When I inspected the output files, what most astonished me was that most of the output files do not contain valid solution code, but instead the original signature along with the pass statement. What steps did I miss to run the evaluations and get the desired results? Thanks.
My results
Edit mode: diff
{
"pass_rate_1": 0.9,
"pass_rate_2": 0.9,
"percent_cases_well_performed": 100
}
Edit mode: whole
{
"pass_rate_1": 1.5,
"pass_rate_2": 1.5,
"percent_cases_well_performed": 100
}
The output
When I inspected the outputs, I noticed that the majority of the code files were not edited to contain the correct solution, but still left with the signature + a simple pass statement. For example the isogram test case has the following output isogram.py:
def is_isogram(string):
pass
The model's config file
Meanwhile here is the model's config.json file:
{
"_name_or_path": "/3fs-jd/prod/deepseek/shared/zhuqihao/public_model/deepseek-coder-7b-instruct2",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"bos_token_id": 32013,
"eos_token_id": 32014,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 8192,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pretraining_tp": 1,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"factor": 4.0,
"type": "linear"
},
"rope_theta": 100000,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.34.1",
"use_cache": true,
"vocab_size": 32256
}
The bash scripts
run.sh
# The model name matches a model directory on my test machine
# MODEL_NAME="Qwen2.5-Coder-7B-Instruct"
export MODEL_NAME="deepseek-coder-6___7b-instruct"
# export MODEL_NAME="DeepSeek-Coder-V2-Lite-Instruct"
# edit format (`whole` / `diff`)
# export EDIT_FORMAT=whole
export EDIT_FORMAT=diff
export CUDA_VISIBLE_DEVICES="2,3"
TP=2
EVAL_SCRIPT="./evaluate.sh"
MODEL_DIR="/data/models/${MODEL_NAME}/"
OUTPUT_DIR="./results/${MODEL_NAME}/${EDIT_FORMAT}"
bash "${EVAL_SCRIPT}" "${MODEL_DIR}" "${OUTPUT_DIR}" "${TP}"
evaluate.sh
MODEL_DIR=${1}
OUTPUT_DIR=${2}
TP=${3}
MODEL_DIR=${MODEL_DIR:-"./pretrained_models/"}
OUTPUT_DIR=${OUTPUT_DIR:-"./results/"}
mkdir -p ${OUTPUT_DIR}
TP=${TP:-2}
echo $TP
ROOT_DIR="."
bash test.sh "${MODEL_DIR}" ${TP} "${OUTPUT_DIR}/aider"
test.sh
export PATH=./aider/bin:$PATH
export HF_ENDPOINT=http://hf-mirror.com
export HF_HOME=""
export HF_DATASETS_OFFLINE=1
export HF_EVALUATE_OFFLINE=1
export OPENAI_API_BASE=http://0.0.0.0:8000/v1
export OPENAI_API_KEY=token-abc123
export MODEL=$1
export TP=$2
export OUTPUT_DIR=$3
export SERVED_MODEL_NAME=$(basename ${MODEL})
export API_MODEL_NAME=openai/${SERVED_MODEL_NAME}
# Edit format is `whole` or `diff`
# normally it should be passed from `run.sh`
if [ -z "$EDIT_FORMAT" ]; then
EDIT_FORMAT=diff
fi
mkdir -p ${OUTPUT_DIR}
echo "Starting serving ${MODEL} as ${SERVED_MODEL_NAME}..."
vllm serve ${MODEL} \
--served-model-name ${SERVED_MODEL_NAME} \
--tensor-parallel-size ${TP} \
--trust-remote-code \
--max-model-len 4096 \
--dtype auto \
--api-key token-abc123 \
> ${OUTPUT_DIR}/vllm-server.txt 2>&1 &
sleep 5
jobs -l > ${OUTPUT_DIR}/jobs.txt
PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
echo "PID: $PID"
echo "Waiting for the model to be served..."
while true; do
if grep -q 'Uvicorn running on' "${OUTPUT_DIR}/vllm-server.txt"; then
echo "Model is being served..."
break
else
echo "Waiting for model to start..."
sleep 1
fi
done
echo "Benchmarking ${SERVED_MODEL_NAME}..."
python benchmark/benchmark.py ${SERVED_MODEL_NAME} \
--new \
--model ${API_MODEL_NAME} \
--edit-format ${EDIT_FORMAT} \
--threads 1 \
> ${OUTPUT_DIR}/log.txt
# extract the required lines from log.txt and use awk to extract the corresponding values
pass_rate_1=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_1' | awk '{print $2}')
pass_rate_2=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_2' | awk '{print $2}')
percent_cases_well_formed=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'percent_cases_well_formed' | awk '{print $2}')
# create JSON-formatted content
json_content=$(cat <<EOF
{
"pass_rate_1": $pass_rate_1,
"pass_rate_2": $pass_rate_2,
"percent_cases_well_formed": $percent_cases_well_formed
}
EOF
)
# write the JSON content to the results.json file
echo "$json_content" > ${OUTPUT_DIR}/results.json
PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
kill ${PID}
it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?
it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?
@cyente What "ds-coder" are you referring to? Thanks.
@ytxmobile98 I think you need to set the --max-model-len to a larger number, like 8192. BTW, you may check the log file to locate the issues.
@ytxmobile98 I think you need to set the
--max-model-lento a larger number, like 8192. BTW, you may check the log file to locate the issues.
Looks like --max-model-len does not help so much. I tried to run Aider diff mode with DeepSeek-6.7B-Instruct model with --max-model-len=8192 and I got the scores of two passes to be both 1.5, just a little bit higher than 0.9 when I ran it the first time.
Update 2024-12-11
@cyente @Hambaobao
I have done some further work in the past two days, testing the Qwen2.5-7B-Instruct model and the DeepSeek-Coder-6.7B-Instruct model, and found one key cause:
The benchmarking program relies on the search-replace blocks to copy code from the chat history and paste them in the *.py files. While the output of the Qwen2.5 model mostly follows the expected format, the DeepSeek model seems to output content as if it is solving a regular coding problem rather than a diff problem.
Example: accumulate
- Qwen2.5-7B-Instruct output: Qwen2.5-7B-Instruct .aider.chat.history.md
- DeepSeek-Coder-6.7B output: DeepSeek-Coder-6.7B-Instruct .aider.chat.history.md