[Feat req.] CLIP embeddings
Hello! I was about to implement a similar app to yours and just stumbled with it. It does 99% of what I wanted but lacks one missing piece, search! Have you considered adding support to CLIP (or friends) to fotema? it would allow two things: 1) filter your library with text and 2) find related/duplicate images. I have no clue about rust (but plenty programming experience), i can try to "vibecode" a PR, i did i little research and looks like HF's candle supports CLIP. What are your thoughts?
Hi @knoopx. I would like to add object recognition to Fotema so that a search feature could be implemented, supporting such searches as find "Pictures of Steve with a Dog in London". Ideally I'd want pictures to be tagged with individual key words like "dog" rather than sentences like "a picture of a dog". I've looked at several object recognition models for supporting this, but they all produced labels in English, and I've not figured out how I would make that support multiple languages.
I'd be happy to see a PR but I have zero interest in "vibe coding".
Hey I think you fundamentally misunderstand how CLIP works. The goal is not to tag images but to extract image embeddings. Then you can write a search query, get the text embeds from it and compare cosine similarity with the library images. CLIP is not an object segmentation and classification model with a fixed number of classes like YOLO.
To support a search like "Pictures of Steve with a Dog in London" all these things need to happen:
- segment all faces in all images (retinaface)
- compute facial embeds for every face (arcface)
- compute clip image embeds (clip)
- parse search string and match people names in the db and get the face embed ("Steve")
- remove people names from search string and compute the clip text embed ("a dog in london").
- match all images with a similar face embed
- match all images with a similar image embed
Hey I think you fundamentally misunderstand how CLIP works.
Probably true.