self-critical.pytorch icon indicating copy to clipboard operation
self-critical.pytorch copied to clipboard

Tokenize method

Open OneDirection9 opened this issue 5 years ago • 0 comments

It looks like you use the tokenized captions by Karpathy, and I want to know the tokenize method to make it consistent with my dataset.

I try the following method from neuraltalk2 to tokenize the coco caption, but can't get the same result as in the link you provided:

def prepro_captions(imgs):
  
  # preprocess all the captions
  print 'example processed tokens:'
  for i,img in enumerate(imgs):
    img['processed_tokens'] = []
    for j,s in enumerate(img['captions']):
      txt = str(s).lower().translate(None, string.punctuation).strip().split()
      img['processed_tokens'].append(txt)
      if i < 10 and j == 0: print txt

Thanks.

OneDirection9 avatar Oct 08 '19 05:10 OneDirection9