Update 2024-12-11

I have found that when running the Aider benchmark with the DeepSeek-Coder-6.7B-Instruct model, most of the results generated by the model did not include the SEARCH/REPLACE blocks which is used by the benchmarking program to save the code into Python source files and run unit tests. See this comment.

Original post on 2024-11-29

I got some extraordinary low results on running Aider benchmark, with the DeepSeek-6.7B-Instruct model. When I inspected the output files, what most astonished me was that most of the output files do not contain valid solution code, but instead the original signature along with the pass statement. What steps did I miss to run the evaluations and get the desired results? Thanks.

My results

Edit mode: `diff`

{
  "pass_rate_1": 0.9,
  "pass_rate_2": 0.9,
  "percent_cases_well_performed": 100
}

Edit mode: `whole`

{
  "pass_rate_1": 1.5,
  "pass_rate_2": 1.5,
  "percent_cases_well_performed": 100
}

The output

When I inspected the outputs, I noticed that the majority of the code files were not edited to contain the correct solution, but still left with the signature + a simple pass statement. For example the isogram test case has the following output isogram.py:

def is_isogram(string):
    pass

The model's config file

Meanwhile here is the model's config.json file:

{
  "_name_or_path": "/3fs-jd/prod/deepseek/shared/zhuqihao/public_model/deepseek-coder-7b-instruct2",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "bos_token_id": 32013,
  "eos_token_id": 32014,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "factor": 4.0,
    "type": "linear"
  },
  "rope_theta": 100000,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.34.1",
  "use_cache": true,
  "vocab_size": 32256
}

The bash scripts

`run.sh`

# The model name matches a model directory on my test machine
# MODEL_NAME="Qwen2.5-Coder-7B-Instruct"
export MODEL_NAME="deepseek-coder-6___7b-instruct"
# export MODEL_NAME="DeepSeek-Coder-V2-Lite-Instruct"

# edit format (`whole` / `diff`)
# export EDIT_FORMAT=whole
export EDIT_FORMAT=diff

export CUDA_VISIBLE_DEVICES="2,3"
TP=2

EVAL_SCRIPT="./evaluate.sh"
MODEL_DIR="/data/models/${MODEL_NAME}/"
OUTPUT_DIR="./results/${MODEL_NAME}/${EDIT_FORMAT}"
bash "${EVAL_SCRIPT}" "${MODEL_DIR}" "${OUTPUT_DIR}" "${TP}"

`evaluate.sh`

MODEL_DIR=${1}
OUTPUT_DIR=${2}
TP=${3}
MODEL_DIR=${MODEL_DIR:-"./pretrained_models/"}
OUTPUT_DIR=${OUTPUT_DIR:-"./results/"}
mkdir -p ${OUTPUT_DIR}
TP=${TP:-2}
echo $TP

ROOT_DIR="."

bash test.sh "${MODEL_DIR}" ${TP} "${OUTPUT_DIR}/aider"

`test.sh`

export PATH=./aider/bin:$PATH

export HF_ENDPOINT=http://hf-mirror.com
export HF_HOME=""
export HF_DATASETS_OFFLINE=1
export HF_EVALUATE_OFFLINE=1

export OPENAI_API_BASE=http://0.0.0.0:8000/v1
export OPENAI_API_KEY=token-abc123

export MODEL=$1
export TP=$2
export OUTPUT_DIR=$3

export SERVED_MODEL_NAME=$(basename ${MODEL})
export API_MODEL_NAME=openai/${SERVED_MODEL_NAME}

# Edit format is `whole` or `diff`
# normally it should be passed from `run.sh`
if [ -z "$EDIT_FORMAT" ]; then
    EDIT_FORMAT=diff
fi

mkdir -p ${OUTPUT_DIR}

echo "Starting serving ${MODEL} as ${SERVED_MODEL_NAME}..."
vllm serve ${MODEL} \
    --served-model-name ${SERVED_MODEL_NAME} \
    --tensor-parallel-size ${TP} \
    --trust-remote-code \
    --max-model-len 4096 \
    --dtype auto \
    --api-key token-abc123 \
    > ${OUTPUT_DIR}/vllm-server.txt 2>&1 &

sleep 5
jobs -l > ${OUTPUT_DIR}/jobs.txt
PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
echo "PID: $PID"

echo "Waiting for the model to be served..."
while true; do
    if grep -q 'Uvicorn running on' "${OUTPUT_DIR}/vllm-server.txt"; then
        echo "Model is being served..."
        break
    else
        echo "Waiting for model to start..."
        sleep 1
    fi
done

echo "Benchmarking ${SERVED_MODEL_NAME}..."

python benchmark/benchmark.py ${SERVED_MODEL_NAME} \
    --new \
    --model ${API_MODEL_NAME} \
    --edit-format ${EDIT_FORMAT} \
    --threads 1 \
    > ${OUTPUT_DIR}/log.txt

# extract the required lines from log.txt and use awk to extract the corresponding values
pass_rate_1=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_1' | awk '{print $2}')
pass_rate_2=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'pass_rate_2' | awk '{print $2}')
percent_cases_well_formed=$(tail -n 100 ${OUTPUT_DIR}/log.txt | grep 'percent_cases_well_formed' | awk '{print $2}')

# create JSON-formatted content
json_content=$(cat <<EOF
{
  "pass_rate_1": $pass_rate_1,
  "pass_rate_2": $pass_rate_2,
  "percent_cases_well_formed": $percent_cases_well_formed
}
EOF
)

# write the JSON content to the results.json file
echo "$json_content" > ${OUTPUT_DIR}/results.json

PID=$(awk '{print $2}' ${OUTPUT_DIR}/jobs.txt)
kill ${PID}

Nov 29 '24 09:11 ytxmobile98

it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?

Dec 02 '24 03:12 cyente

it is the repo of qwen2.5-coder, maybe you should submit your issue to ds-coder?

@cyente What "ds-coder" are you referring to? Thanks.

Dec 02 '24 07:12 ytxmobile98

@ytxmobile98 I think you need to set the --max-model-len to a larger number, like 8192. BTW, you may check the log file to locate the issues.

Dec 02 '24 08:12 Hambaobao

@ytxmobile98 I think you need to set the --max-model-len to a larger number, like 8192. BTW, you may check the log file to locate the issues.

Looks like --max-model-len does not help so much. I tried to run Aider diff mode with DeepSeek-6.7B-Instruct model with --max-model-len=8192 and I got the scores of two passes to be both 1.5, just a little bit higher than 0.9 when I ran it the first time.

Dec 04 '24 00:12 ytxmobile98

Update 2024-12-11

@cyente @Hambaobao

I have done some further work in the past two days, testing the Qwen2.5-7B-Instruct model and the DeepSeek-Coder-6.7B-Instruct model, and found one key cause:

The benchmarking program relies on the search-replace blocks to copy code from the chat history and paste them in the *.py files. While the output of the Qwen2.5 model mostly follows the expected format, the DeepSeek model seems to output content as if it is solving a regular coding problem rather than a diff problem.

Example: `accumulate`

Qwen2.5-7B-Instruct output: Qwen2.5-7B-Instruct .aider.chat.history.md
DeepSeek-Coder-6.7B output: DeepSeek-Coder-6.7B-Instruct .aider.chat.history.md

Dec 11 '24 06:12 ytxmobile98

Aider benchmark, DeepSeek-6.7B-Instruct model hardly generates SEARCH/REPLACE blocks, leading to very low pass rates

Update 2024-12-11

Original post on 2024-11-29

My results

Edit mode: diff

Edit mode: whole

The output

The model's config file

The bash scripts

run.sh

evaluate.sh

test.sh

Update 2024-12-11

Example: accumulate

Edit mode: `diff`

Edit mode: `whole`

`run.sh`

`evaluate.sh`

`test.sh`

Example: `accumulate`