ROOT icon indicating copy to clipboard operation
ROOT copied to clipboard

scenevlm finetune

Open 77h2l opened this issue 1 year ago • 8 comments

significant work! How can user SFT the scenevlm model under the own specified scene?

77h2l avatar Nov 28 '24 11:11 77h2l

Thanks for your recognition of our work! We'll explain how to fine-tune on custom datasets later, so feel free to keep following us.

harrytea avatar Nov 28 '24 12:11 harrytea

hi, @harrytea strongly interested on this work. There are still some questions hope to get reply: 1.Is the released ckpt https://huggingface.co/harrytea/ROOT/tree/main finetuned scenevlm? 2.Where is the scene VQA dataset, and hierarchical scene graph generation dataset 3.How can users construct the scene graph dataset of their own situation according to the format in this paper and sft, I think a tutorial is rather essential.

Thx.

77h2l avatar Dec 26 '24 07:12 77h2l

hi, @harrytea strongly interested on this work. There are still some questions hope to get reply: 1.Is the released ckpt https://huggingface.co/harrytea/ROOT/tree/main finetuned scenevlm? 2.Where is the scene VQA dataset, and hierarchical scene graph generation dataset 3.How can users construct the scene graph dataset of their own situation according to the format in this paper and sft, I think a tutorial is rather essential.

Thx.

Thank you again for your interest in our work.

  1. Yes, it has already been fine-tuned.
  2. The SceneVQA dataset is for commercial use, so we cannot open it.
  3. I promise to write a fine-tuning tutorial this week, please be patient.

harrytea avatar Dec 26 '24 08:12 harrytea

hi, @harrytea strongly interested on this work. There are still some questions hope to get reply: 1.Is the released ckpt https://huggingface.co/harrytea/ROOT/tree/main finetuned scenevlm? 2.Where is the scene VQA dataset, and hierarchical scene graph generation dataset 3.How can users construct the scene graph dataset of their own situation according to the format in this paper and sft, I think a tutorial is rather essential.

Thx.

Updated. If you have any questions, please feel free to ask~

harrytea avatar Dec 26 '24 09:12 harrytea

@harrytea sorry to bather u again. I have the latest following questions:

  1. how did u label the scene graph data, could u provide some examples for labeled data format.
  2. how to test the scene graph accuracy

77h2l avatar Jan 06 '25 10:01 77h2l

@harrytea sorry to bather u again. I have the latest following questions:

  1. how did u label the scene graph data, could u provide some examples for labeled data format.
  2. how to test the scene graph accuracy
  1. label data After the first two steps, we can use utils/show_point.py to display the names of objects in the image. Then, we manually annotate the objects in the image. There are four types of relationships: support, attach, contain, and hang. You can annotate some data and then train a version of the model, iterating on this model continuously.
{
    "floor": {
        "support": [
            {
                "rug": {
                    "support": [
                        {
                            "dining table": {}
                        },
                        {
                            "white sofa": {
                                "support": [
                                    {
                                        "colorful pillow_0": {}
                                    },
                                    {
                                        "colorful pillow_3": {}
                                    },
                                    {
                                        "colorful pillow_2": {}
                                    }
                                ]
                            }
                        },
                        {
                            "modern chairs": {}
                        }
                    ]
                }
            }
        ]
    },
    "ceiling": {
        "attach": [
            {
                "chandelier": {}
            }
        ]
    },
    "wall": {
        "hang": [
            {
                "paintings": {}
            },
            {
                "wooden door": {}
            }
        ]
    }
}

Final data format

Q: [You can refer to the prompt in our paper.] A: You can directly use the JSON example above, or you can use GPT to convert it into a purely conversational form. That is, the answer should be in conversational form + JSON.

  1. You can refer to our paper and use F1, recall, and precision to measure accuracy. We haven't uploaded the code yet; I will reorganize them and provide you with reference.

harrytea avatar Jan 07 '25 08:01 harrytea

@harrytea thx for your reply, actually I have reproduced your work, and I try to use it for monitoring scenario understanding, the real offline industry scene is much more complex than your self-made dataset, which result in that the structured scene graph was rather difficult to establish and assess, since the vanilla vlm have the ability to process the not very find-grained scene VQA, do u have any ablation study indicate that through the middle scene graph to enhance the final VQA ability or scene understanding not for distance measure.

77h2l avatar Jan 07 '25 10:01 77h2l

@harrytea thx for your reply, actually I have reproduced your work, and I try to use it for monitoring scenario understanding, the real offline industry scene is much more complex than your self-made dataset, which result in that the structured scene graph was rather difficult to establish and assess, since the vanilla vlm have the ability to process the not very find-grained scene VQA, do u have any ablation study indicate that through the middle scene graph to enhance the final VQA ability or scene understanding not for distance measure.

Thank you for your interest in our research.

The scene graphs in the paper were also generated using VLM, and both the scene graphs and distance predictions are aimed at exploring VLM's spatial understanding capabilities. Our experiments show that the performance on the datasets mentioned in the paper is acceptable.

In fact, our business scenario is within a game. Since indoor game scenes can be batch-generated through another 3D pipeline, we have a large amount of data to train our VLM. We already have all the metadata for the entire root pipeline. It performs well in simple game scenes. (btw, we not only trained the forward pipeline but also constructed a reverse one. We modified the data format to input a series of objects and let the model output a reasonable room layout, then indexed the objects to generate indoor scenes.)

If you're using it in an industry setting, can you use some vision foundation models to extract key information from the image for fine-grained objects? For example, models like RAM or GroundingDINO. Then, feed this information into the VLM to boost its VQA performance.

harrytea avatar Jan 09 '25 03:01 harrytea