Chuofan Ma comments

Results 18 comments of


                                            Chuofan Ma

Why not use the text guidance for the OV-LVIS setting in the config?

Oh, I see. Then it should be A100 80G GPUs. Sorry for the mistake.

Gradio web server deployment

Hi there, thanks for your interest in our work. We have not yet implemented gradio demo for Groma. The gradio code was directly inherited from LLaVA. Therefore, you may have...

Lvis eval result file missing

Hi, you can download the LVIS result file [here](https://huggingface.co/datasets/FoundationVision/groma_data/blob/main/lvis_test.json).

Request for Modified Dataset to Resolve Training Issue : sharegpt4v_instruct_gpt4-vision_cap100k_new.json

Hi there, I modified `sharegpt4v_instruct_gpt4-vision_cap100k_new.json` simply because several images (less than 10) have incorrect paths in the original json annotations. But for some reasons, I do not have access to...

Referring multiple regions in the image

Yes, this framework theoretically supports multiple referring regions as input. For example, you can do this by prompting the model with `Please briefly describe and ` and setting the box...

Demo Issue

Yes, it looks good to me.

The working mechanism of the classifier

Hi there, thank you for your interest in our work. Yes, the classifier works in the same way as CLIP, i.e, the classifier weights are essentially composed of text embeddings.

The working mechanism of the classifier

It's 'a xxx'.