intercode icon indicating copy to clipboard operation
intercode copied to clipboard

Results of eval_n_turn not match the paper

Open hansjohn opened this issue 1 year ago • 0 comments

I run the eval_n_turn.py to reproduce the single turn handicap sql results

python -m experiments.eval_n_turn \
    --data_path ./data/sql/spider/ic_spider_dev.json \
    --dialogue_limit 5 \
    --env sql \
    --image_name docker-env-sql \
    --log_dir logs/experiments \
    --max_turns 1 \
    --policy chat \
    --template game_sql \
    --model gpt-3.5-turbo \
    --handicap \
    --verbose 

i use this script to compute the success rate:


import json
from re import T
result_file_path = './logs/experiments/ic_sql_multiturn_gpt-3.5-turbo_1_turns.json'
with open(result_file_path, 'r') as f:
    result = { key: {'success':0, 'total':0} for key in ['easy', 'medium', 'hard', 'extra','all'] }
    data = json.load(f)
    
    for index in data.keys():
        if data[index]['summary']['max_reward'] == 1.0:
            result[data[index]['hardness']]['success']+=1
            result['all']['success']+=1
        result[data[index]['hardness']]['total']+=1
        result['all']['total']+=1

    for key in result.keys():
        success = result[key]['success']
        total = result[key]['total']
        print(f"{key} Success rate: {success}/{total} ({success/total:.2%})")

get this result:

easy Success rate: 202/248 (81.45%) medium Success rate: 281/446 (63.00%) hard Success rate: 75/174 (43.10%) extra Success rate: 37/166 (22.29%) all Success rate: 595/1034 (57.54%)

It is lower than the result in paper. Did I do something wrong?

I also run the eval_n_turn.py to reproduce the single turn sql results.

python -m experiments.eval_n_turn \
    --data_path ./data/sql/spider/ic_spider_dev.json \
    --dialogue_limit 5 \
    --env sql \
    --image_name docker-env-sql \
    --log_dir logs/experiments \
    --max_turns 1 \
    --policy chat \
    --template game_sql \
    --model gpt-3.5-turbo

Result is here:

easy Success rate: 41/248 (16.53%) medium Success rate: 28/446 (6.28%) hard Success rate: 3/174 (1.72%) extra Success rate: 2/166 (1.20%) all Success rate: 74/1034 (7.16%)

Did I do something wrong?

hansjohn avatar Mar 23 '24 09:03 hansjohn