self-critical.pytorch
self-critical.pytorch copied to clipboard
Tokenize method
It looks like you use the tokenized captions by Karpathy, and I want to know the tokenize method to make it consistent with my dataset.
I try the following method from neuraltalk2 to tokenize the coco caption, but can't get the same result as in the link you provided:
def prepro_captions(imgs):
# preprocess all the captions
print 'example processed tokens:'
for i,img in enumerate(imgs):
img['processed_tokens'] = []
for j,s in enumerate(img['captions']):
txt = str(s).lower().translate(None, string.punctuation).strip().split()
img['processed_tokens'].append(txt)
if i < 10 and j == 0: print txt
Thanks.