arctic-captions icon indicating copy to clipboard operation
arctic-captions copied to clipboard

Post your evaluation score

Open intuinno opened this issue 8 years ago • 39 comments

Hello, everyone,

I got the following score after I ran the coco.

{'CIDEr': 0.50350648251818364, 'Bleu_4': 0.20037826460154334, 'Bleu_3': 0.2920434703847389, 'Bleu_2': 0.42775646056296673, 'Bleu_1': 0.6105274018537202, 'ROUGE_L': 0.43556281782994649, 'METEOR': 0.23890246684760072}

So METEOR is almost same. However my BLEU score are 7~8% lower than paper. I wonder if this is acceptable or there is something wrong in my process.

Would you please share your results in this post?

Thanks.

intuinno avatar Mar 09 '16 23:03 intuinno

Hi, @intuinno what hyperparamter did you pick. I tried to run the on the coco data, the algorithm terminate after 15 epochs and achieved the following scores, which is much lower than yours.

Bleu 1: .527 Bleu 2: .333 Bleu 3: .210 Bleu 4: .138 METEOR: .163 ROUGE_L: .403 CIDEr: .371

snakeztc avatar Mar 21 '16 03:03 snakeztc

Hi, @intuinno , @snakeztc I am running this codes based on the flickr8k, for I just try to test this code. I have run it and got visualization, but I don't know how to get the scores--Bleu and METEOR, could you tell which script about it? Please forgive me if bother you.

yaxingwang avatar Mar 25 '16 16:03 yaxingwang

Hi, @intuinno, @snakeztc, @yaxingwang my flickr8k scores with parameters in capgen.train() are BLEU = 0.504 / 0.270 / 0.145 / 0.082 Max are BLEU = 0.550 / 0.296 / 0.164 / 0.095 with parameters in eval_coco, + optimizer = rmsprop . My scores are lower than paper: BLEU = 0.670 / 0.457 / 0.314 / 0.213

@yaxingwang metrics.py or I am using a neuraltalk's script(see my repository)

AAmmy avatar Mar 31 '16 08:03 AAmmy

Hi @AAmmy , I am also get similar results like you. But BLEU-1 you get is batter then me(0.30). Do you do normalization for Dataset? After it, I get less result. The metrics.py is used to get results.

yaxingwang avatar Mar 31 '16 08:03 yaxingwang

@yaxingwang I did not normalize Dataset. My preprocess:

  1. center crop images
  2. resize images to 224x224
  3. extract features with VGG_ILSVRC_19_layers

token and train, valid, test are the same with the files in Flickr8k_text.zip

AAmmy avatar Mar 31 '16 08:03 AAmmy

Yes, we am same. I did it, since the result is poor.
for me : 1, if width > height: width = (width * resize) / height # resize = 256 height = resize else: height = (height * resize) / width width = resize 2, center crop images to 224x224 3, extract features by Vgg_layers5_4 I am confused whether all parameters are same for three dataset, I just run the code offered by @intuinno, and I found the parameters for three datasets are same. Besides, what the value of epoch when script stops, I got epoch = 79, I don't know whether It is over-fitting.

yaxingwang avatar Mar 31 '16 08:03 yaxingwang

@yaxingwang I think intuinno's evaluate_flickr8k.py's parameters are for coco and flickr30k, the parameters for flickr8k and for flickr30k, coco are not same.(5.2 in paper)

I think parameters in original capgen.py file are for flickr8k.(I used this, stopped around 70epoc)

I also did flickr8k learning with the parameters for coco (same as intuinno evaluate_flickr8k.py, so same with you?), the sores were BLEU = 0.493 / 0.258 / 0.130 / 0.072 eary stop on Epoc 89 (6-12 hours)

I also changed patience and some parameters to check over fitting, after 89 epoc, samples from val seemed getting better, but BLEU score (from test) was getting worse.

AAmmy avatar Mar 31 '16 09:03 AAmmy

@AAmmy , Thank you. I try to do the both flickr30k and coco, but I guess the memory of my computer is too small to process the flickr30, so I am doing it. Do you met the question when doing the flickr30k, It notes MemoryError.

When using epoch = 10, 20, the result is less good than the matter the script early stop . I think got epoch by early stop is not the best, for it has not strong relation with scores. Maybe testing different epochs is optimal.

yaxingwang avatar Mar 31 '16 09:03 yaxingwang

@yaxingwang I have the same memory problem on coco, and process for sparse to dense is too slow, so I extracted feature into one file for each image.

I changed code and data format like below.

caption example:

train_cap = [['a dog running', OOO.jpg], ['dogs running', OOO.jpg],
                       ..., ['a cat running', +++.jpg], ['cats running', +++.jpg]]

('a dog running' and 'dogs running' are captions for OOO.jpg, OOO.jpg is the image file name, OOO.jpg.mat will be the feature from OOO.jpg)

In flickr.py or coco.py:

In prepare_data():

# load target feat file each time
for cc in caps:
    seqs.append([worddict[w] if worddict[w] < n_words else 1 for w in cc[0].split()])
    feat_list.append(loadmat(feat_path + str(cc[1]) + '.mat')['feats']) # my code
    # feat_list.append(features[cc[1]]) # original code
# OOO.jpg.mat is dense() matrix, so no need to todense()

# y = numpy.zeros((len(feat_list), feat_list[0].shape[1])).astype('float32') # original code
# for idx, ff in enumerate(feat_list): # original code
    # y[idx,:] = numpy.array(ff.todense()) # original code
# y = y.reshape([y.shape[0], 14*14, 512]) # original code
y = numpy.array(feat_list).reshape([len(feat_list), 14*14, 512]).astype('float32') # my code

In load_data():

# only caption files are loaded
train_cap = pkl.load(open(path+'flicker_30k_cap.train.pkl', 'rb'))
train_feat = []

Hmm... I will try on different epochs.

AAmmy avatar Mar 31 '16 10:03 AAmmy

@AAmmy , Thanks.

yaxingwang avatar Mar 31 '16 10:03 yaxingwang

I also created my own scripts to prepare the data. I completely skipped the sparse matrix stuff since I think it's not needed at all. I have a single hdf5 file with CONV5_4 features from VGG19 network for Flickr30k (around 12GB). This file contains all the image features for all splits in the following order: train, valid and test. The order of the jpeg files for matching the order of the feature matrix is also available.

I am pretty sure that I am not doing any mistake (but apparently i am doing since you have at least some results) but all I got is repetitive phrases of meaningless words, with BLEU of 0 and a validation loss which doesn't improve at all.

I create the dictionary in a frequency ordered fashion, 0 is and 1 is UNK.

I don't know where is the problem at all.

ozancaglayan avatar Apr 01 '16 14:04 ozancaglayan

@intuinno, your results are closest the reported coco results, which hyper-parameters have you used?

@kelvinxu , @kyunghyuncho , paper does not mention hyper-parameters for different datasets. would you mind providing this information? (plus maybe even the models themselves which are not big for a dropbox/gdrive file)

volkancirik avatar Apr 11 '16 22:04 volkancirik

Hi everybody,

I'd like to share my observations and experimentations about the code on Flickr30k dataset:

Preprocessing:

  • I have a separate HDF5 file for train/dev/test splits containing the convolutional features extracted from Flickr30k dataset using VGG19 network. Since the current way of creating a PKL file with captions and sparse matrices is so inefficient (it even doesn't work with Python 2.7 because of a pickle bug with huge files) I directly load those HDF5 files and I only keep the tokenized captions and image idxs in the pkl file. I create a dictionary with words occuring >= 3 times leading to a dictionary of 9584 words.

Feature dimensions:

  • This is specific to how you create your feature file. What is done in the original code, i.e. y.reshape([y.shape[0], 14*14, 512]) was not correct for my feature file and I was obtaining complete nonsense during training. Ensure that the reshaping is done correctly.

Early stopping with BLEU:

This seems critical and it's mentioned in the paper as wel but unfortunately not implemented in the code. The validation loss is not correlated with BLEU or METEOR. I just save the model into a temporary file before each validation and call generate_caps.py to save the hypotheses inside a file. I then used the pycocoevalcap utilities to obtain BLEU1-BLEU4 and METEOR scores. After that you can select upon which metric you would like to early stop.

Validation:

I normalized the validation loss w.r.t sequence lengths as well. This seems a better estimate of validation loss as the default one is sensible to the caption lengths in the validation batches.

Hyperparameters:

I'm still experimenting but the best working system so far had the following parameters:

n_words: 9584
maxlen: 100
decay_c: 1e-05
alpha_c: 0 (This is 1 in the original code)
use_dropout: False (dropout is enabled by default in the original code)
patience: 10
ctx_dim: 512
dim: 1000 (This is 1800 in the original code)
dim_word: 512
batch_size: 128
optimizer: adam (rmsprop is OK too but adadelta is completely failing)
lstm_encoder: False
n_layers_init: 2
n_layers_att: 2
n_layers_lstm: 1
n_layers_out: 1
ctx2out: True
prev2out: True
selector: True,
attn_type: deterministic (didn't try the hard one)
validFreq: 500

Results:

I trained a system yesterday with early-stopping on BLEU (but this was using the multi-bleu.perl script which has different dynamics than the pycocoevalcap utilities). I generated the captions with sampling instead of beam-search during validation periods. At the end I obtained the following results with the best validation model:

(EDIT: Fixed the results of my system which was for the validation split instead of the test split.)

Description BLEU1 BLEU2 BLEU3 BLEU4 METEOR
Beam (12) 57.9 39.3 26.9 18.5 17.58
Sampling 61.2 41.4 28.12 19.1 16.77
Paper results (soft) 66.7 43.4 28.8 19.1 18.49
Paper results (hard) 66.9 43.9 29.6 19.9 18.46

Problems:

The main problem are the duplicate captions in the final files:

$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.sampling.dev.txt | wc -l
853
$ sort -u adam-512emb-1000lstm-wdecay-att2-init2-flickr30k-en-bleu.beam12.1best.dev.txt | wc -l
790

So out of 1014 validation images, I can only generate 853/790 unique captions. This seems to be an important problem that I'm facing. The richness of the captions is also quite limited. For the sampling case, I have 497 unique words out of a vocabulary of ~10K words. For beamsearch, the number is 561.

EDIT I actually checked the captions generated and the images. Eventhough there are for example 10 instances of "a group of people are standing outside" for 10 different images, it's actually true in terms of scene description: In all of the images there are some people standing outside :) So maybe this can be related to the weak diversity of Flickr30k dataset.

ozancaglayan avatar Apr 12 '16 08:04 ozancaglayan

~~The BLEU result of multi-bleu.perl and pycocoevalcap are very different. I got 65% on BLEU1 with multi-bleu.perl, but bleu.py in pycocoevalcap showed around 50% on the same samples and GTs.~~ Nothing

AAmmy avatar Apr 15 '16 07:04 AAmmy

Hi, @ozancaglayan Could you share your code handling batch normalization process, plese?

Validation:

I normalized the validation loss w.r.t sequence lengths as well. 
This seems a better estimate of validation loss as the default
one is sensible to the caption lengths in the validation batches.

AAmmy avatar Apr 22 '16 05:04 AAmmy

Hi @intuinno Would you share the model file trained on Coco? Also, what are your best validation/test cost for Flickr8k and Coco? Thanks.

frajem avatar Apr 22 '16 10:04 frajem

So does anyone get better score on coco? I used @intuinno 's code and I got similar score with him(one the top of this issue) in the end (17epochs). However, when I calculated the score on epoch 10, it turned out to be better than 17th epoch, which is BLEU: 0.6398/0.4518/0.3127/0.218/ METEOR:0.2384

Lorne0 avatar May 09 '16 01:05 Lorne0

I got BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234 on COCO with http://cs.stanford.edu/people/karpathy/deepimagesent/ the feature size is 4096, so I used them by reshaping 8x512. However Flickr8k training was failed. I didn't try on Flickr30k.

AAmmy avatar May 12 '16 00:05 AAmmy

@AAmmy Hi,I tested your code, and got 'Bleu_4': 0.276, 'Bleu_3': 0.367, 'Bleu_2': 0.497, 'Bleu_1': 0.668 with beam_size = 10. was your result based on beam_size of 1?

xinghedyc avatar May 21 '16 16:05 xinghedyc

@AAmmy @xinghedyc Could you please explain about using http://cs.stanford.edu/people/karpathy/deepimagesent/ for me? You used that for extracting features?

Lorne0 avatar May 22 '16 08:05 Lorne0

@Lorne0 Hi, what you could download form that website is a coco dataset as COCO (750MB)http://cs.stanford.edu/people/karpathy/deepimagesent/coco.zip, the vgg_feats.mat contains extracted features through the vgg-net as 4096-dimention for each image, and json file contains all the captions. for more details you can read their paper.

xinghedyc avatar May 22 '16 08:05 xinghedyc

@xinghedyc Thank you. But I still not understand. The feature is 4096 , and @AAmmy said reshaping it as 8x512, and then? Which 8 of the new features should I use?

Lorne0 avatar May 22 '16 08:05 Lorne0

@Lorne0 I think 8×512 means 8 annotation vectors which the author's paper defines them as a = {a1,...,aL}, ai ∈ RD, you can refer section 3.1.1 in the paper. The original code uses 196 × 512 as the annotation vectors, so @AAmmy tested 8 annotion vectors in soft attention mode by using the dataset above, it actually works.

xinghedyc avatar May 22 '16 12:05 xinghedyc

@xinghedyc Thank you~ I just ran 3 epochs but I use metrics.py I always got IOError: [Errno 32] Broken pipe in pycocoevalcap/meteor/meteor.py Did you have this problem?

Lorne0 avatar May 22 '16 14:05 Lorne0

@Lorne0 Yes,I also got this problem,so I just comment that line in the metrics.py like this scorers = [ (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]), #(Meteor(),"METEOR"), #(Rouge(), "ROUGE_L"), #(Cider(), "CIDEr") ] This is because I care more about bleu, but you could try to fix this problem :)

xinghedyc avatar May 22 '16 14:05 xinghedyc

@xinghedyc I think METEOR is important too. I'll try to fix it, thank you~ @AAmmy could you help us about this problem?

Lorne0 avatar May 22 '16 14:05 Lorne0

@xinghedyc I think I found the solution Just delete the pycocoevalcap/ and clone the newest one :)

Lorne0 avatar May 22 '16 15:05 Lorne0

@Lorne0 OK, I'll check it.

xinghedyc avatar May 22 '16 15:05 xinghedyc

@Lorne0 @xinghedyc My result BLEU: 0.6887/0.5034/0.3588/0.2547 METEOR: 0.2234 is based on beam_size 1. I checked only on epoch 19. May be there some other epoch results show more better score. (in epoch from 1 to 18) References were made by code written in scripts.py.

AAmmy avatar May 23 '16 01:05 AAmmy

@AAmmy thanks, I got bleu-4 23.9 if I use beam size 0f 1, but got 27.6 use beam size of 10. I only trained 11 epochs. maybe more epochs should be trained.

xinghedyc avatar May 23 '16 05:05 xinghedyc