TagStudio [Feature Request]: AI assisted tagging using object detection / recognition

Checklist

[x] I am using an up-to-date version.
[x] I have read the documentation.
[x] I have searched existing issues.

Description

To begin with, I'm aware of TagStudio way of thinking for AI related stuff (mentioned in the readme) and consider that this feature could be integrated totally locally without any dependency on external online services and compromising user privacy.

To get a better understanding of how this feature could work, having some bases on how YOLO models used will help. (As I'm going to use YOLO models as a example). As a quick tldr; YOLO models can be trained to perform multiples actions, in this case we would be interested in Image Classification and Objects detections.

So here is how I see it, this feature would be added as a new menu AI assisted tagging. From there, you could choose to select non-tagged entries, or select specific entries if they meet certain criteria, such as having the same sub-directory. I would additionaly see the possibility to have a right-click option when selecting multiples entries from the UI to process them with the "AI assisted tagging" feature.

The AI tagging would work as follow (again, assuming YOLO models, but a lot of models work in similar ways). The feature would support two modes, Image Classification, where only one TAG is added as a result, and Objects detections, where multiples TAG could be added as a result.

For Image Classification: The user would first need to map all of the model known "classification" to a specific TAG in TagStudio, alternatively a option to ignore a specific classification could be provided. The user should also be able to export and import this configuration. As for the inference and how TAG would be added, we would take from the inference results the classification that has the biggest probability of beeing right and add to the TagStudio entries the TAG associated.

For Objects detections: The user would first need to map all of the model known "objects" to a specific TAG in TagStudio along with the minimum probability (likelihood) of the object really be present in the picture/video, alternatively a option to ignore a specific object could be provided. As for the inference and how TAG would be added, we would take the detected objects, verify each objects to ensure they are bigger than the minimum probability allowed, and add the TAG to the TagStudio entry.

This is the main idea of the feature, of course I think there could be others nice additions such as adding a toggle to make the feature semi-automatic, meaning it would manually confirm for each entry the TAGs to be added with the user before adding them.

Don't hesitate to talk about it and give your ideas, I'm interested in other points of view too.

Would also close (complete) https://github.com/TagStudioDev/TagStudio/issues/326. Again this would be done totally locally from the python app itself, libraries already exists too.

Mar 16 '25 19:03 kitsumed

Since I used YOLO & ultralytics as examples to pitch the feature, I might as well link ultralytics existing package to train and use YOLO models: https://github.com/ultralytics/ultralytics Free version (non-commercial) is licensed AGPL3, TagStudio is GPL3, I don't think there would be issues. (Not legal advice)

Mar 16 '25 20:03 kitsumed

I think the thoughts on this can be found here: https://github.com/TagStudioDev/TagStudio/discussions/626#discussioncomment-11462770

Mar 16 '25 21:03 Tishj

I think the thoughts on this can be found here: #626 (reply in thread)

I wasn't aware of this discussions, thanks for bringing it up. They seem to refer to using LLM to link TagStudio entries using files properties. My feature requests does not make use LLM / uses file properties. Instead, it uses object detection and image recognition to apply already existing tags, making tagging faster than manual tagging.

I do think the idea of allowing LLMs to tag files using a file properties is great, even trought unrelated to this issue, however two major issue come to me, LLMs are hard to make behave like you want and it get worse the smaller the model is. Second is, LLMs are resource extensive to run, not everyone has a good enought hardware to run a LLM model intelligent enought to handle the task right, and running smaller models will get you worse / unreliable results.

YOLO models and the likes on the other hand, are less resource intensive, yes, always slower on really old hardware, but can still run and get good results, there's no risk of getting a output that's not "usable" as theses type of model where made specifically for the usage we are using them for.

LLMs are great for certains tasks, like text generation, formatting content, translations, correction, chatbot that calls functions, and more, but does not seems (to me) to have a real adventage in the current context that would justify the hassle of implementing it.

Mar 16 '25 22:03 kitsumed

I appreciate taking the time to read through the readme and taking into consideration my thoughts on features like this. To paraphrase those thoughts, I'm not interested in shipping TagStudio with models or hardcoded integrations to services (e.g. an OpenAI account, which I understand is definitely not what you're asking for here). What I would like to do however is at least allow for an interface in which TagStudio can interact with models in order to achieve the goal of machine learning assisted tagging for those who wish to have that.

I've been slowly and quietly working on a configurable "macros" system over on the macros branch that I think might just be the key to easily achieving something like this, depending on what's required from these models. This macro system is currently built to allow users to create their own configuration files that are essentially TOML files with a specific TagStudio schema. These files include conditions and actions to be performed when ran on selected or given file entries. One "action" I have in particular is the ability to add tags to file entries based on content from external files, for example JSON sidecar files or perhaps a singular TXT or CSV file.

The key takeaways I'm getting at is that TagStudio will have the ability to:

Perform configured actions on selected/specified files (Currently operational)
Read and import data, including tag string data, from external files (Currently operational with JSON)
Specifically map imported strings to specific TagStudio tags, or map to None (Currently operational)
Provide custom menu items for each macro configuration file (Currently operational)
And more, but probably outside the scope of what's required here

For a specific example of what I mean by the menu items, here's the "Macros" menu on that branch with the first three menu options each actually just being TOML files on disk in a new .TagStudio/macros folder. Even the names are configurable inside each file to allow for clean menu options.

What I'll ask from you now is: how exactly does this model system handle the importing and exporting of data?

Assuming TagStudio can write or pass the filenames to be processed and the model can be executed via a subprocess, if there's a way for the model to write the results to a format such as JSON, XML, or CSV then I think this feature, including both your requests for image classification and object detection specifics, should be more than achievable under this macro system.

Mar 17 '25 00:03 CyanVoxel

Thanks for that response, honestly didn't think it would be this detailed 👌

The Macro feature indeed does seems like it will be really useful in the future and I wasn't aware it had that many fonctionallity, I'm now more interessed in it than before 😊.

I'd also like to point out that the idea of this feature is not to ship TagStudio with a trained model, but to give users the ability to use whatever model they want and map them to the TAGs they want. (I wasn't sure if I'd made this clear enough in my first post). I understand, however, that implementing ultralytics/alternative can be seen as making the application dependent on a “service”, even locally, although I hadn't thought of it that way at first because TagStudio has support for “dupeGuru” files which I also saw as a some sort of local service.

What I'll ask from you now is: how exactly does this model system handle the importing and exporting of data?

Importing of data There is two way to use ultralytics, you can use it as a library and call inference to get results directly from the code (https://docs.ultralytics.com/modes/predict/#inference-sources) or you can do it from a CLI with yolo detect predict model=path/to/best.pt source='./bus.jpg' (Example taken from https://docs.ultralytics.com/usage/cli/#predict). There are more parameters available in the docs.

The ultralytics library/CLI is able to do predictions on the following files : https://docs.ultralytics.com/modes/predict/#image-and-video-formats

Exporting of data I did quick searche in the docs of the package of ultralytics and there is the ability to save the prediction output directly as a text file when using the CLI by defining save_txt on true. It would, according to the docs, follow this format: [class] [x_center] [y_center] [width] [height] [confidence], where each new line is a new detection. (We are only interessed in the class and confidence). https://docs.ultralytics.com/modes/predict/#inference-arguments

Taking this into account, there is indeed a file the macro could parse to tag entries, I'm just not sure how advanced the parsing system of the macros feature is and if it would indeed work as I never tried it before.

What is worrying me about the macro system is if it be able create some sort of list of all of the detection by parsing a file that does not have any "standard format" and insead uses a unique format, while also performing a check on confidence values to only apply new TAG to thoses who have more than the minimum required. Im assuming when you said And more, but probably outside the scope of what's required here that somes of theses feature are not yet existing, but that it would be possible, am I right?

Mar 17 '25 01:03 kitsumed

I guess a random question from me out of these implementation things i see here, Do this auto tagging model ai things are they like unable to learn from your preexisting library and not just add on the stuff that the AI was originally trained on, cause espeshly in some types of tagging scenarios it not being able to use the data inside the libary to suggest tags seems like it could get quite annoying

like i unno, the current scenario in my head is like you want to find and tag all things that are line art or something, however the model that came with whatever macro wasnt trained specifically line art, so it wont tag it, if it could be guided todo so that would make this feature alot better (and for my uses acculy useable)

then again i guess there is also a reason in there that at that point one should just make an other macro to train a model using the data in the data base and then use that to do the whatever

(maybe im just not understanding i unno)

Apr 21 '25 05:04 Thecreatre

I guess a random question from me out of these implementation things i see here, Do this auto tagging model ai things are they like unable to learn from your preexisting library and not just add on the stuff that the AI was originally trained on, cause espeshly in some types of tagging scenarios it not being able to use the data inside the libary to suggest tags seems like it could get quite annoying

You can use an existing YOLO model and train new additional data on it, but that would require you to either do it manually or implement a way in TagStudio UI to continue training the model. I'm sure the ultralytics module allows this, but that would be extra work, although I see how it would be useful. That said, it would probably be limited to image classification training, as object detection require the dataset to have masks.

then again i guess there is also a reason in there that at that point one should just make an other macro to train a model using the data in the data base and then use that to do the whatever

I'm not 100% sure I understand what you meant, but it it's in the lines of "user would need to train their own model" then yes, that's what I would have responded.

Apr 21 '25 06:04 kitsumed

Thats alot of words that im not sure i understand but i appreciate the effort

Apr 21 '25 20:04 Thecreatre

although i dunno how one trains a model, and i dunno how one could extract all the data from tag studio todo so in this way (my head thinks it would have to be some plugin program that hooks into tag studio and trains the a model, or further trains an existing model with more data (like i unno you used the model for tag suggestions you chose witch where right for a few (or just added more stuff for it to look at) and then pulled up the plugin and had it run again)

Apr 22 '25 02:04 Thecreatre

it would probably be limited to image classification training, as object detection require the dataset to have masks.

Im not entirely sure what this means, then again what im looking for is it to classify things (Ie tag things)

but i dont know what the difference between the two are exactly so, annoying

Apr 22 '25 02:04 Thecreatre

but i dont know what the difference between the two are exactly so, annoying

although i dunno how one trains a model, and i dunno how one could extract all the data from tag studio

In short, Image classification returns one result, so the model will look at the whole picture and say "It's THAT thing" along with his confidence. Object detection on the other hand will look at the picture and says "I've found the following things, at theses location", along with his confidence for every object found.

What this mean is that, when you're training an Image classification model, you only need to train the model using a picture and a tag name. On the other hand, when training an Object detection model, you need a picture, a mask that highlight where the object is in the picture and a tag name.

As TagStudio only has tags and no masks, the only realistic outcome I would see would be adding the functionality to train an Image classification model from TagStudio using the user tags. That said, an Object detection model with a mask the size of the picture may work, but then it wouldn't really be an "Object detection" model that returns the location of said objects anymore. We also don't need to know the location for TagStudio tagging.

EDIT: You might understand better with a visual example, I've trained some months ago a Object detection model that only detect speech bubbles, you can test it for free here, select YOLOv8m Speech Bubble (kitsumed). You will see, if you put a comic or manga picture, that the model returns the picture with all of the speech bubbles in red or blue along with their name "speech bubbles".

Apr 22 '25 12:04 kitsumed

I see very interesting, your OBJ detect point is kinda confusing to read but i do agree that knowing where the object is, probbly isnt useful for tagging in Tagstudio in most cases

Apr 23 '25 06:04 Thecreatre