EPR icon indicating copy to clipboard operation
EPR copied to clipboard

How to evaluate results after prediction?

Open jiacheng-ye opened this issue 3 years ago • 8 comments

Hi Ohad, Thanks for your awesome work! I have several questions when using the code:

  1. how to directly perform BM25 retrieval and few-shot inference on the validation set? (26.0 as shown in Table 3)
  2. how to evaluate results given the predictions?

jiacheng-ye avatar Sep 03 '22 11:09 jiacheng-ye

I've figured out solusions about above questions. With the default parameters in codebase, I got 26.15 with BM25.

However, the EPR performs even worse (22.9) after training the BERT-based retriever. I run EPR with python run.py dataset=break dpr_epochs=120 gpus=1 partition=NLP. I'm not sure where it went wrong :( Waiting for your help and thanks in advance.

jiacheng-ye avatar Sep 05 '22 03:09 jiacheng-ye

Hey, this might be related to the fact that you are using a single gpu, the DPR setup benefits greatly from a large batch size. The result of 31.9% LFEM from the paper is using 4 GPUs.

OhadRubin avatar Sep 05 '22 04:09 OhadRubin

Hi,

Here is the full list of commends:

#!/bin/bash
#SBATCH --job-name=epr_mtop-null_v4
#SBATCH --output=outputs/epr_mtop-null_v4/out.txt
#SBATCH --error=outputs/epr_mtop-null_v4/out.txt
#SBATCH --partition=NLP
#SBATCH --time=12000
#SBATCH --quotatype=reserved
#SBATCH --gres=gpu:2
srun python find_bm25.py output_path=$PWD/data/bm25_mtop-null_a_train.json \
	 dataset_split=train setup_type=a task_name=mtop +ds_size=null L=50 \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
	 scorer.py example_file=$PWD/data/bm25_mtop-null_a_train.json \
	 setup_type=qa \
	 output_file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
	 batch_size=8     +task_name=mtop +dataset_reader.ds_size=null \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/train_dense_encoder.py train_datasets=[epr_dataset] \
	 train=biencoder_local \
	 output_dir=$PWD/experiments/epr_mtop-null_a_train \
	 datasets.epr_dataset.file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
	 datasets.epr_dataset.setup_type=qa  datasets.epr_dataset.hard_neg=true \
	 datasets.epr_dataset.task_name=mtop     datasets.epr_dataset.top_k=5 \
	 +gradient_accumulation_steps=1 train.batch_size=120 \
	 train.num_train_epochs=30 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/generate_dense_embeddings.py \
	 model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
	 ctx_src=dpr_epr shard_id=0 num_shards=1 \
	 out_file=$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index \
	 ctx_sources.dpr_epr.setup_type=qa \
	 ctx_sources.dpr_epr.task_name=mtop +ctx_sources.dpr_epr.ds_size=null \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/dense_retriever.py \
	 model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
	 qa_dataset=qa_epr ctx_datatsets=[dpr_epr] \
	 datasets.qa_epr.dataset_split=validation \
	 encoded_ctx_files=["$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index_*"] \
	 out_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
	 ctx_sources.dpr_epr.setup_type=qa \
	 ctx_sources.dpr_epr.task_name=mtop datasets.qa_epr.task_name=mtop \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
	 inference.py \
	 prompt_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
	 task_name=mtop \
	 output_file=$PWD/data/validation_epr_mtop-null_a_train_prede.json \
	 batch_size=10 max_length=1950 \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4

On mtop dataset, number of training data is 95961. The training loss is around 0.07 after 30 epoches, avg loss per batch is 0.071158.

As I'm using A100 80G, I only use two gpus as it is sufficient for 120 batch size. Finally, I got 25.19 on break and 50.87 on mtop. Any advice would be helpful 😂

jiacheng-ye avatar Sep 06 '22 10:09 jiacheng-ye

I think dpr_epochs=120 is the correct hyperparameter parameter, the contrastive learning objective is able to improve greatly with more compute. I think the default hp of dpr_epochs=30 was for where I needed to run a large number of experiments. Recreate our results 120 epochs are necessary. I think..

OhadRubin avatar Sep 06 '22 11:09 OhadRubin

I got 49.17 after training 120 epochs on mtop, it's still weird... 😂

jiacheng-ye avatar Sep 06 '22 15:09 jiacheng-ye

I will run some tests of my own and try to make sense of this thing. I'll keep you updated!

OhadRubin avatar Sep 06 '22 16:09 OhadRubin

Hi Ohad, do you have any updates? :)

jiacheng-ye avatar Sep 16 '22 02:09 jiacheng-ye

Nice work! Anyone know the enviroment requirement file of EPR?

RobertMarton avatar Sep 23 '22 09:09 RobertMarton