moondream icon indicating copy to clipboard operation
moondream copied to clipboard

Moondream for Object Localization

Open E5GEN2 opened this issue 1 year ago • 6 comments

I am wondering if Moondream can be used for grounding tasks such Object Localization? Something similar to what cogagent does with GUI but I would like to train on my custom dataset. If I fine-tune moondream on my custom dataset of images - bounding boxes + text is there a chance it would work?

E5GEN2 avatar May 28 '24 10:05 E5GEN2

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

vikhyat avatar May 29 '24 17:05 vikhyat

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners. The next release will add support for multiple objects and may also change the output format. I'll post an update here when it's out.

what if i have a dataset of images + actions i.e. {"x1": 420, "x2": 378, "y1": 1042, "y2": 245, "action": "swipe", "duration": 200}

would it be able to predict such actions if i train it on my dataset?

Is it possible to predict a next action for a sequence of images + actions? If not, what if I create a collage image of previous images + actions. Would it be able to learn such a task?

E5GEN2 avatar May 29 '24 18:05 E5GEN2

While making dataset for fine tuning what is the format in which we have to give the co ordinates, and are you using another regression loss or is it completely the text decoder model giving the co ordinates as string.

Shalom-P avatar Jun 04 '24 07:06 Shalom-P

While making dataset for fine tuning what is the format in which we have to give the co ordinates, and are you using another regression loss or is it completely the text decoder model giving the co ordinates as string.

I have the same question.

ander008 avatar Dec 05 '24 03:12 ander008

Bump, curious on this

arthurcolle avatar Jan 04 '25 18:01 arthurcolle

Yes - the current version of moondream can detect one object per image. If you query with Bounding box: {object} it will return an array of 4 floating point numbers that indicate the relative (x1, y1) and (x2, y2) positions for the top-left and bottom-right corners.

@vikhyat not working for me.. it still only returns a point plot on the image instead.. can you kindly share example to get OCR vaues of name plate like this via prompt with bounding boxes.. (I would like to tune the prompt later to only take interested values)

image

parthi2929 avatar Jan 12 '25 09:01 parthi2929