List Objects ID (images, dataset, project, ...) by tag ID
Description
Hello Eric,
I want to propose a function to list objects attached to tags get_object_ids_by_tag
Also includes a parameter describing how multiple tags should be combined according to one of the three set operations [union, intersection, difference].
Checklist
- [ ] Recompile the doc
- [ ] Add new units tests
- [ ] Clean style in the spirit of PEP8
For reviewers
- Check that the PR title is short, concise, and will make sense 1 year later.
- Check that new functions are imported in corresponding
__init__.py.
I like the functionality, wonder if it maybe deviates from current style...we aren't really generically handling objects and instead use filter functions, but those are limited to filtering sets of image ids. This is because omero-py's getObject() works pretty well on its own, ~but the documentation for searching on e.g., tag value is not great (but I do think it is possible).~
Thoughts @erickmartins ?
Edit: You can't use tag value with getObjects, which is why they have a separate function for that, which is used in the code submitted in the PR :)
Yeah, I think I generally agree that, overall, we would prefer to keep gets fairly generic (especially for IDs) and have functionality like this in filter functions. I realize that there's no generic get_object_ids (and we would like to have a generic filter-anything-by-anything function in the future), but I still think this would be better served by a filter_by_tag_id, overall.
(for the record, I think that can be more generic than filter_by_tag_value, which currently only accepts Image IDs. Having object type and list-of-IDs as an argument is a better way to do it.)
Doing it with filter_by_tag_id, to get all images with a given tag, I first have to get all the images accross the whole Group. And that's some involved looping in the case of Screens.
I use tags very much like folders, to group Images, Datasets, Projects, Wells, and not so much to say "image A is this" (I leave that to key-value pairs). For example, I would use a tag to select at random images on which to add ground truth segmentation, or for someone to show me what images I have to process, ...
Hence I need a way to make it easier to retrieve Objects with a certain tag. (IJ macro extension seem to use it in that way too :D )
That said, I understand your point of keeping the gets generic. So how about instead adding a tag parameter to get_project, get_dataset, get_image and get_well ? Would that be ok for you?
(the "union", "intersection", "difference" things were here because I could, it's fine to do without)
Let me know!
I think this is probably a good compromise - eventually we should probably have more generic gets (that might be able to get, e.g., all images in a group) and a more generic filter function that can filter anything by anything. In the meantime, I think I'm fine with an optional argument on gets - @mellertd ?
I would favor a parameter that takes a dictionary for searches, with keys belonging to a restricted set for which we have queries built. so it would look like search_params = {'tag_name': 'somename'}
This is similar in concept to how OME does it, except I would want to thoroughly document each key that we implement. And then we could stack queries to make them more complex like `search_params = {'tag_id':230, 'kv_pair': ('genotype','wt')}
But keep in mind, there is a point at which I would ask, "why?". Recall with ezomero we aren't trying to replace omero-py, we are supplementing. There is no inherent problem with using a mix of functions and, if performance is important, it might be worth it to just write some bit entirely using omero-py or even the python wrappers for the java classes themselves. Just a thought to keep the conversation grounded.
That's a fair point - maybe a workflow where you are using omero-py directly (or HQL) to retrieve all images in a group, for example, and then a new filter_by_tag_id that does the other part of this is feasible?
@mellertd Isn't search_params = {'tag_name': 'somename'} an overkill?
My point is not to make a search engine that would retrieve all objects with a given tag/KVpair value, ....
Instead, I use tags in collaboration, to group stuff and say to each other, "look at those images/datasets, can you review/process/analyze them?" (the value of the tag doesn't really matter, it's just a name for human readability).
OMERO.web lists images, datasets, ... under tags just like if they were "attached" to those. So I would expect to find a direct way to list those objects from a tag:
My question is, do you want to have ezomero replacing
image_ids = [o.getId() for o in conn.getObjectsByAnnotations("Image", [tagId])]
dataset_ids = [o.getId() for o in conn.getObjectsByAnnotations("Dataset", [tagId])]
...
with
image_ids = ez.get_image_ids(conn, tag_id=tagId)
dataset_ids = ez.get_dataset_ids(conn, tag_id=tagId)
...
filter_by_tag_id, I guess that works too (I'd also need to have it implemented for Datasets, Wells, ...), but the reason I want to have a function to give me the IDs rather than filtering them, is the same reason that I am not listing all images of a group, to then filter them according to their dataset ID.
The reason why I would prefer the dict approach is because there is, to my mind, no real reason to stop at tags as a search parameter, which means we would have to add a function parameter every time someone wants to add something new to search on. Thoughts, @erickmartins ?
That said, those list comprehensions you have there don't look bad, except for the fact that you'll possibly be loading a bunch of objects that you don't want to load. HQL is probably the fastest way to do that, performance-wise, but that is certainly clunkier to write (and read) than your list comprehension.
Either way, I am not quite clear on what your concern is with the filter approach (I don't like the filters, but probably for an entirely different reason). If you are using ezomero to get image_ids, the filter is a simple set intersection that only adds one remote function call. Stylistically, I view Project and Dataset as very different from an Annotation in terms of the object model, so treating Project and Dataset separately from TagAnnotations isn't weird to me, but maybe I am an outlier here.
I see your point of "why stopping there", and I like solutions that cover possible future cases.
On the side of the OME model, key-value pairs and tags are both annotations (and are even stored in the same table if I am not mistaken). You could use them interchangeably to describe the content of an image or flag images.
But in my opinion, they are two very different things and should not be confused:
- Key-value pairs are very good at describing the image content in detail (can be quite a long list of KV attached as a single annotation)
- Tags are annotations attached to multiple objects (like images, datasets, ...), so they can be used just like folders in OMERO.web (you could also try linking a KV annotation to multiple objects, but I'm afraid to go down that road).
So if I stick to those differences and tell my users to follow those guidelines, I need the handy dandy functions of ezomero to help them to script and find back those concepts:
- Directly list objects attached to a given tag ("inside a folder") -> what I'm suggesting
- Read the KV pairs attached to an object to know more about it
So I'm in favor of having dedicated functions/parameters for Tags, and others for KV pairs (instead of interchangeable parameters to handle all).
And I agree on the difference of Project/Dataset and TagAnnotation: Project & Dataset & Images can be annotated (tags can't), and have a hierarchy, thus there can be an "inheritance of annotations" (tags can only be grouped in tagsets). => one can put KV pairs or tags on the dataset instead of repeating them on individual images.
There could be dedicated handling of inheritance of annotation in ezomero too:
(back to my original function), we could have:
get_object_ids_by_tag(conn, "Image", [tag_id_used_for_dataset], use_inheritance=True)
that would list all the images contained by the tagged-parent-container. (task of obtaining images)
and something like
get_object_kv(conn, "Image", image_id, use_inheritance=True)
that would get all the annotations of the image, including those on the parent dataset and project. (task of describing images)
I will present in more detail those ideas of Tags/KV/Project+Dataset usage and differences at the next AIMM meeting (1st of December, https://www.bioimagingnorthamerica.org/events/aimm-user-group/) if you are interested. The meeting is about discussing automatic tools for annotation, but my point is that we need to define how to use OMERO first, then make tools/scripts to use the structure and make fancy stuff.
Maybe I make a branch with those inheritance functions I'm talking about, and add other things just to see what difference it would make.
Thanks for the discussion!
Hi Tom,
thanks for the good discussion. I totally understand the drive to have things that would make your workflows easier incorporated to ezomero (that's how we got started with ezomero in the first place, after all!), but there's always an impact in terms of maintainability and technical debt for us (as main maintainers), even when the code from the PR looks clean and nice.
I think it's partially on us to maybe have "stronger"/more specific guidelines on what fits ezomero - and I think we do need a discussion on what that actually is, and what our roadmap for the future looks like. I think it's essential to us that we're not duplicating functionality from omero-py, and that we're never going to fully replace it. That's outside the scope of what we would like to do, and what we can do. In this case, for example, the list comprehensions you had as examples are totally okay, except maybe for performance; at the same time, we cannot expect to have an ezomero function for every HQL query out there.
I think it's a delicate balance to reach and something we'll need to work on - thanks again for the PR to kickstart that discussion! I'll discuss this further with @mellertd and see if we can reach a consensus on what a function like this should look like, and what a better set of guidelines/roadmap for ezomero should be.
Certainly for the implementation, list comprehension or HQL if you prefer for the performance is not the issue (I can always rewrite it the Ezomero way).
As you pointed out, it's much more a matter of what you want ezomero to be, and of the style of the functions and parameters. +1 for not letting me spoil your very nice library ;)
Let me know if want my input further, but I'll leave this PR as it is for now.