fotema [Feat req.] CLIP embeddings

Hello! I was about to implement a similar app to yours and just stumbled with it. It does 99% of what I wanted but lacks one missing piece, search! Have you considered adding support to CLIP (or friends) to fotema? it would allow two things: 1) filter your library with text and 2) find related/duplicate images. I have no clue about rust (but plenty programming experience), i can try to "vibecode" a PR, i did i little research and looks like HF's candle supports CLIP. What are your thoughts?

Jul 17 '25 20:07 knoopx

Hi @knoopx. I would like to add object recognition to Fotema so that a search feature could be implemented, supporting such searches as find "Pictures of Steve with a Dog in London". Ideally I'd want pictures to be tagged with individual key words like "dog" rather than sentences like "a picture of a dog". I've looked at several object recognition models for supporting this, but they all produced labels in English, and I've not figured out how I would make that support multiple languages.

I'd be happy to see a PR but I have zero interest in "vibe coding".

Jul 18 '25 05:07 blissd

Hey I think you fundamentally misunderstand how CLIP works. The goal is not to tag images but to extract image embeddings. Then you can write a search query, get the text embeds from it and compare cosine similarity with the library images. CLIP is not an object segmentation and classification model with a fixed number of classes like YOLO.

Jul 18 '25 08:07 knoopx

To support a search like "Pictures of Steve with a Dog in London" all these things need to happen:

segment all faces in all images (retinaface)
compute facial embeds for every face (arcface)
compute clip image embeds (clip)
parse search string and match people names in the db and get the face embed ("Steve")
remove people names from search string and compute the clip text embed ("a dog in london").
match all images with a similar face embed
match all images with a similar image embed

Jul 18 '25 09:07 knoopx

Hey I think you fundamentally misunderstand how CLIP works.

Probably true.

Jul 18 '25 17:07 blissd