Oscar Generating inputs to Oscar model

Hi Oscar Team,

Thanks for the interesting paper and open-sourcing your model.

On your download page, you mention that images are fed into Oscar through the outputs of a "Faster R-CNN with ResNet-101, using object and attribute annotations from Visual Genome". Have you made this model available too? It would be great if you could give a link to this pre-trained model, as it is necessary to run Oscar on my own images (I'm interested in image captioning and VQA).

I have tried to look for it myself, and the closest thing I could find was the R101-FPN from the Detectron2 model zoo (PyTorch model). However, this was trained on the COCO dataset of object tags, and I understand that the Visual Genome has significantly more labels. So surely this one would fail to produce the image features that Oscar expects?

I'd be grateful if you could let me know if my thinking is correct and if there is a link to the appropriate PyTorch model for generating inputs that Oscar can use.

Thanks in advance!

Dec 18 '20 11:12 lukerm

I used this codebase and pre-trained models, https://github.com/peteanderson80/bottom-up-attention, it is pre-trained on VG with 1600 tags.

Dec 18 '20 20:12 xjli

@xjli Thanks for this! I did however have quite a lot of issues getting this model to run with Caffe. Just as an example, the model will not load when using the standard Caffe Docker image.

Do you know of an openly available Docker image where I can run the bottom-up attention model?

Dec 21 '20 08:12 lukerm

I personally use this one: https://github.com/airsplay/py-bottom-up-attention

Jan 07 '21 06:01 vincentlux

@vincentlux I am planning to use the repo that you have mentioned above - https://github.com/airsplay/py-bottom-up-attention, to generate features another dataset I have been working on. I was wondering if you could get similar performance to what is stated in Oscar paper using the features generated by airsplay's bottom-up-attention repo?

Jan 16 '21 03:01 gsrivas4

I tested on the retrieval task and the performance is ~2 points lower, did you test? Not sure if it is because the location feature processing is different or the visual feature is different. Would be great if @xjli can provide more information on the feature extraction stage.

Jan 24 '21 00:01 vincentlux

I tested on the retrieval task and the performance is ~2 points lower, did you test? Not sure if it is because the location feature processing is different or the visual feature is different. Would be great if @xjli can provide more information on the feature extraction stage.

I personally use this one: https://github.com/airsplay/py-bottom-up-attention

Can I know how did you get the label.tsv from the code??

Feb 09 '21 10:02 brightbsit

@vincentlux I am planning to use the repo that you have mentioned above - https://github.com/airsplay/py-bottom-up-attention, to generate features another dataset I have been working on. I was wondering if you could get similar performance to what is stated in Oscar paper using the features generated by airsplay's bottom-up-attention repo?

Did you trained with your own dataset? I am trying to do the same. I managed to get captions for my own images using the features of Bottom-Up-Attention, but OSCAR fails to train with the features of a entirely new dataset. So: inference works, but training does not.

The class ExceptionWrapper is re-raising the original error, but fails to do so, so the original error is hidden to me. The message is: __init__() missing 2 required positional arguments: 'doc' and 'pos' debug, which seems to be caused by a malformed JSON.

Feb 17 '21 19:02 EByrdS

@brightsit I also try to use https://github.com/airsplay/py-bottom-up-attention. to generate the feature.tsv, I use the detectron2_mscoco_proposal_maxnms.py, but make little modification. to generate the label.tsv, I use the demo_feature_extraction.ipynb from the same folder(instance variabel is the label we want),also need some modification. the value generated is the value we want, but the format is different, need additional step to match the format with coco_dataset.

i try to do inference using my own dataset, but no luck i'm getting __init__() missing 2 required positional arguments: 'doc' and 'pos' debug :"( maybe if i'm success i will post the code in here

Feb 25 '21 10:02 vinson2233

@vincentlux I am planning to use the repo that you have mentioned above - https://github.com/airsplay/py-bottom-up-attention, to generate features another dataset I have been working on. I was wondering if you could get similar performance to what is stated in Oscar paper using the features generated by airsplay's bottom-up-attention repo?

Did you trained with your own dataset? I am trying to do the same. I managed to get captions for my own images using the features of Bottom-Up-Attention, but OSCAR fails to train with the features of a entirely new dataset. So: inference works, but training does not.

The class ExceptionWrapper is re-raising the original error, but fails to do so, so the original error is hidden to me. The message is: __init__() missing 2 required positional arguments: 'doc' and 'pos' debug, which seems to be caused by a malformed JSON.

I am trying to get caption for my own image and now am facing lots of problem to install caffe. can you share me your docker file if you have one? or sharing the procedure you followed to get caption would be appreciated. thank you!

Mar 10 '21 15:03 DesaleF

Yeah I would also appreciate some example code for how to run inference / predict captions

Mar 12 '21 18:03 guillefix

For me, __init__() missing 2 required positional arguments: 'doc' and 'pos' debug was caused by an issue loading the feature tsv file. airsplay/py-bottom-up-attention formats the tsv file with single quotes, but json requires double quotes. This hacky fix in the CaptionTensorizer class fixed things:

def get_image_features(self, img_idx):
    # Updated to format json appropriately
    dict_str = self.feat_tsv.seek(img_idx)[1]
    dict_str = dict_str.replace('"', '"""')
    dict_str = dict_str.replace("'", '"')
    feat_info = json.loads(dict_str)
    num_boxes = feat_info['num_boxes']
    features = np.frombuffer(base64.b64decode(feat_info['features']), np.float32
            ).reshape((num_boxes, -1))
    return torch.Tensor(features)

After this, there is one more step to making inference work on a new dataset. Bottom-Up Attention generates features of size 2048, but Oscar expects 2054. The difference comes from concatenating information about the bounding boxes' positions (Ref: last paragraph on pg. 4 of the original research paper).

Luckily, that is explained with working code here: https://github.com/microsoft/Oscar/issues/33#issuecomment-702864446.

Mar 12 '21 18:03 anbrjohn

@anbrjohn Thank you very much! That helps a lot

Mar 12 '21 21:03 DesaleF

@brightsit I also try to use https://github.com/airsplay/py-bottom-up-attention. to generate the feature.tsv, I use the detectron2_mscoco_proposal_maxnms.py, but make little modification. to generate the label.tsv, I use the demo_feature_extraction.ipynb from the same folder(instance variabel is the label we want),also need some modification. the value generated is the value we want, but the format is different, need additional step to match the format with coco_dataset.

i try to do inference using my own dataset, but no luck i'm getting __init__() missing 2 required positional arguments: 'doc' and 'pos' debug :"( maybe if i'm success i will post the code in here

@vinson2233 Could you share what modifications you made to the 'detectron2_mscoco_proposal_maxnms.py' file to obtain the features tsv file for oscar?

Aug 25 '21 16:08 rachs

For me, __init__() missing 2 required positional arguments: 'doc' and 'pos' debug was caused by an issue loading the feature tsv file. airsplay/py-bottom-up-attention formats the tsv file with single quotes, but json requires double quotes. This hacky fix in the CaptionTensorizer class fixed things:
def get_image_features(self, img_idx):
    # Updated to format json appropriately
    dict_str = self.feat_tsv.seek(img_idx)[1]
    dict_str = dict_str.replace('"', '"""')
    dict_str = dict_str.replace("'", '"')
    feat_info = json.loads(dict_str)
    num_boxes = feat_info['num_boxes']
    features = np.frombuffer(base64.b64decode(feat_info['features']), np.float32
            ).reshape((num_boxes, -1))
    return torch.Tensor(features)
After this, there is one more step to making inference work on a new dataset. Bottom-Up Attention generates features of size 2048, but Oscar expects 2054. The difference comes from concatenating information about the bounding boxes' positions (Ref: last paragraph on pg. 4 of the original research paper).

Luckily, that is explained with working code here: #33 (comment).

If you don't want to touch the run_captioning.py code, you could convert the single-quotes into double-quotes by running a script like below after generating the tsv files

LABEL_FILE = "./custom.label.tsv"
FEATURE_FILE = "./custom.feature.tsv"

def quote_conversion(path):
    with open(path, 'r') as f:  
        text = f.read() 

    converted_text = text.replace("'", '"') 

    with open(path, 'w') as f: 
        f.write(converted_text)

quote_conversion(LABEL_FILE)
quote_conversion(FEATURE_FILE)

Nov 11 '21 13:11 zamanmub