moondream Moondream for Object Localization

I am wondering if Moondream can be used for grounding tasks such Object Localization? Something similar to what cogagent does with GUI but I would like to train on my custom dataset. If I fine-tune moondream on my custom dataset of images - bounding boxes + text is there a chance it would work?

May 28 '24 10:05 E5GEN2

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

May 29 '24 17:05 vikhyat

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

what if i have a dataset of images + actions i.e. {"x1": 420, "x2": 378, "y1": 1042, "y2": 245, "action": "swipe", "duration": 200}

would it be able to predict such actions if i train it on my dataset?

Is it possible to predict a next action for a sequence of images + actions? If not, what if I create a collage image of previous images + actions. Would it be able to learn such a task?

May 29 '24 18:05 E5GEN2

While making dataset for fine tuning what is the format in which we have to give the co ordinates, and are you using another regression loss or is it completely the text decoder model giving the co ordinates as string.

Jun 04 '24 07:06 Shalom-P

While making dataset for fine tuning what is the format in which we have to give the co ordinates, and are you using another regression loss or is it completely the text decoder model giving the co ordinates as string.

I have the same question.

Dec 05 '24 03:12 ander008

Bump, curious on this

Jan 04 '25 18:01 arthurcolle

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners.

@vikhyat not working for me.. it still only returns a point plot on the image instead.. can you kindly share example to get OCR vaues of name plate like this via prompt with bounding boxes.. (I would like to tune the prompt later to only take interested values)

Jan 12 '25 09:01 parthi2929