fotema icon indicating copy to clipboard operation
fotema copied to clipboard

[Feat req.] CLIP embeddings

Open knoopx opened this issue 6 months ago • 4 comments

Hello! I was about to implement a similar app to yours and just stumbled with it. It does 99% of what I wanted but lacks one missing piece, search! Have you considered adding support to CLIP (or friends) to fotema? it would allow two things: 1) filter your library with text and 2) find related/duplicate images. I have no clue about rust (but plenty programming experience), i can try to "vibecode" a PR, i did i little research and looks like HF's candle supports CLIP. What are your thoughts?

knoopx avatar Jul 17 '25 20:07 knoopx

Hi @knoopx. I would like to add object recognition to Fotema so that a search feature could be implemented, supporting such searches as find "Pictures of Steve with a Dog in London". Ideally I'd want pictures to be tagged with individual key words like "dog" rather than sentences like "a picture of a dog". I've looked at several object recognition models for supporting this, but they all produced labels in English, and I've not figured out how I would make that support multiple languages.

I'd be happy to see a PR but I have zero interest in "vibe coding".

blissd avatar Jul 18 '25 05:07 blissd

Hey I think you fundamentally misunderstand how CLIP works. The goal is not to tag images but to extract image embeddings. Then you can write a search query, get the text embeds from it and compare cosine similarity with the library images. CLIP is not an object segmentation and classification model with a fixed number of classes like YOLO.

knoopx avatar Jul 18 '25 08:07 knoopx

To support a search like "Pictures of Steve with a Dog in London" all these things need to happen:

  • segment all faces in all images (retinaface)
  • compute facial embeds for every face (arcface)
  • compute clip image embeds (clip)
  • parse search string and match people names in the db and get the face embed ("Steve")
  • remove people names from search string and compute the clip text embed ("a dog in london").
  • match all images with a similar face embed
  • match all images with a similar image embed

knoopx avatar Jul 18 '25 09:07 knoopx

Hey I think you fundamentally misunderstand how CLIP works.

Probably true.

blissd avatar Jul 18 '25 17:07 blissd