Position-Focused-Attention-Network
Position-Focused-Attention-Network copied to clipboard
reproduce
Hi, HaoYang First of all, I appreciate the article you wrote, the content is very clear, but through the code you opened, according to the parameters you provided, the best performance of a single model still could not reach the performance of your paper. here is my train.sh and results of this best model CUDA_VISIBLE_DEVICES=0 python train.py --data_path /data_SCAN --data_name coco_precomp --logger_name runs/coco_VSRN --max_violation
results: Computing results... Images: 5000, Captions: 25000 Image to text: 74.5, 94.7, 98.0, 1.0, 2.0 Text to image: 61.8, 88.9, 94.4, 1.0, 5.9 rsum: 512.3 ar: 89.1 ari: 81.7 Image to text: 73.0, 93.8, 96.8, 1.0, 2.5 Text to image: 59.9, 87.2, 93.2, 1.0, 6.1 rsum: 503.9 ar: 87.9 ari: 80.1 Image to text: 75.0, 95.1, 97.6, 1.0, 2.0 Text to image: 60.7, 88.2, 94.1, 1.0, 5.1 rsum: 510.7 ar: 89.2 ari: 81.0 Image to text: 71.3, 94.3, 98.1, 1.0, 2.0 Text to image: 58.8, 87.7, 94.1, 1.0, 5.4 rsum: 504.4 ar: 87.9 ari: 80.2 Image to text: 72.1, 94.3, 97.5, 1.0, 2.1 Text to image: 61.1, 89.3, 94.7, 1.0, 5.4 rsum: 509.0 ar: 88.0 ari: 81.7
Mean metrics: rsum: 530.4 Average i2t Recall: 80.9 Image to text: 73.2 94.4 97.6 1.0 2.1 Average t2i Recall: 508.0 Text to image: 60.4 88.3 94.1 1.0 5.6
Can you give me some advice on your best model?
Similarly, I tried to re-conduct the experiment on the dataset Flickr30K, but can't reach the best performance which the paper reported.
I used the vocab.py
to re-build the vocabulary and ran the run_train.sh
, and both the files are provided from this repo.
The performance which I got as follow:
calculate similarity time: 257.417900085
Sims shape: (1000, 5000)
rsum: 436.2
Average i2t Recall: 80.3
Image to text: 59.1 87.8 93.9 1.0 5.3
Average t2i Recall: 65.1
Text to image: 44.9 71.5 79.0 2.0 19.8
The run_train.sh
as follow:
python train_attention.py --data_path ./data/ --data_name f30k_precomp --vocab_path ./vocab/ --logger_name ./runs/f30k_precomp/ --model_name ./runs/f30k_precomp/ --max_violation --bi_gru --agg_func=Mean --cross_attn=t2i --lambda_softmax=9 --num_epochs=30 --lr_update=15 --learning_rate=.0002 --embed_size=1024 --val_step=2000000 --batch_size=128
So did you try it on the Flickr30K ? @gedaye11 And could you help us to get the best performance? @HaoYang0123