TagStudio
TagStudio copied to clipboard
Add search by metadata
Context
Related to issue #272
Features / Changes
- Major refactoring over core.library.Library.search_library method
- Add method for parsing keys for file metadata from search query
- Add support for negative query
Usage example
In a new version, specifying all tags with whitespace as usual will return result from AND search type, regardless of one selected in UI:
tag1 tag2 - return all entries, tagged with 'tag1' AND 'tag2'
If you want to use OR, you need to specify OR search type in UI, and provide tags, separated by '|' symbol
tag1 | tag2 - (with OR search type) return all entries, tagged with 'tag1' OR 'tag2'
tag1 tag2 | tag3 - (with OR search type) return all entries, tagged with 'tag1' and 'tag2' OR with 'tag3'
Search by non-tag metadata is also available by 'key: value' syntax:
description: desc - return all entries, which description contains 'desc'
tag1 ; description: desc - return all entries, tagged with 'tag1' AND which description contains 'desc'
description: desc | description: smth:
- (with OR search type) return all entries, which description contains either 'desc' or 'smth'
- (with AND search type) return all entries, which description contains both 'desc' and 'smth'
Negative query can be set by specifying lookup value as 'null':
tag1; description: null - return all entries, tagged with 'tag1' and not having a description (or having a null description)
Original negative query tags like empty or no author also supported:
no author tag1; description: desc - return all entries without author, tagged with 'tag1' and description containing 'desc'
Refactoring note
For ease of maintenance, many functions of original code were extracted to new methods in library.Library class. To avoid clutter, I extracted them even further to a new class 'library.Filter'. This should help maintain the codebase in future, but due to lack of tests, I decided to extract to 'Filter' only new methods, and import any required variables and methods, to aviod breaking other functionality of 'library.Library' class:
class Filter:
def __init__(self, lib: Library) -> None:
# self.lib = lib
self.ext_list = lib.ext_list
self.entries = lib.entries
self.is_exclude_list = lib.is_exclude_list
self.get_field_attr = lib.get_field_attr
self._tag_strings_to_id_map = lib._tag_strings_to_id_map
self.get_tag_cluster = lib.get_tag_cluster
self.library_dir = lib.library_dir
self._field_id_to_name_map = lib._field_id_to_name_map
self.get_field_obj = lib.get_field_obj
self.missing_files = lib.missing_files
This class was added in last commit, so this can be reverted with ease.
Other possible side effects
Changed code works fine at a first glance, but it is not clear for me how exactly original code worked on some edge cases, especially with coalitions.
I can write tests for it, but I would like to have test cases, for deeper understanding on how should it work
cherry-piked commit with tests All github actions passed except for ruff, possibly because of files I didn't touch. On my machine, I get no errors on files I edited:
# at TagStudio/tagstudio
> ruff check --config ../pyproject.toml src/core/library.py
All checks passed!
> ruff check --config ../pyproject.toml tests/core/
All checks passed!
PR is ready for review
@lunaro-4 can you please rebase your branch? There are some changes which are already merged in main branch, eg .github/workflows/pytest.yaml
also fix the formatting in your code, so the ruff workflow pass, thanks
@yedpodtrzitko
- [x] Rebase done
- [x] All checks passed
for the negative syntax it would be nice if it work the same way as with google, ie. if I want to exclude something, I could write -author:foo instead of not author:foo. But that's just wishful thinking, the decision is not up to me.
Thank you for the feedback, I will comment again with a mention in a few days, when I apply your suggestions
Would getting the field refactor in #157 tested and pulled first help with this PR, or at this point would it be like pouring water over a grease fire? (Not to compare this PR to a fire, it's just a big feature on top of some of my worst code in the pre-existing codebase). Since the database will be migrating anyway at some point, it wouldn't matter if the existing field types were locked in as constants that would be easier to work with while we're still using JSON.
Would getting the field refactor in #157 tested and pulled first help with this PR, or at this point would it be like pouring water over a grease fire? (Not to compare this PR to a fire, it's just a big feature on top of some of my worst code in the pre-existing codebase). Since the database will be migrating anyway at some point, it wouldn't matter if the existing field types were locked in as constants that would be easier to work with while we're still using JSON.
any substantial PR will be most likely working with library.py, so there will never be better time to merge it than "now", as the number of potential conflicts in code will be always teh same, but working with the constants instead of magic numbers will make it easier to work with the existing code. I just rebased the PR to fix the conflicts, so from my point of view it's ready for merge.
@yedpodtrzitko @CyanVoxel Thank you for your patience, I had some IRL stuff last week and only now am able to return to the project.
Here is overview of last changes:
- Proper negative prompt: you can specify field contents, you want to exclude, by specifying field name with
-in front of it:-description(in part without key specified) will exclude any entry withdescriptionfield-description: foowill exclude all entries, containingfooin description.-description: foo; description: barwill select all entries, that containbarbut do not containfooin description field
- Minor code refactoring: applied feedback from code review:
- entity metafields' data is not lost, if entity has multiple fields with same ID
- some original code rewritten for better readability
- new
SpecialFlagclass
- Prepared for merge
- [x] New code covered with tests
- [x] Merged with one of recent commits
- [x] All GitHub workflows green
I can also write update for README, to clarify new search syntax, if you approve my changes
Words with capital letters in text fields cannot be searched.
If the description is Banker goes to the circus then searching description: goes to the circus returns the entry, but description: Banker does not. Only the capitalization in the original field matters, capitalization in the search still does not matter. Searching description: GOES to THE CiRcUs still returns the entry.
I assume this bug applies to the Title, Author, Artist, URL, Description, and Notes fields, though I did not check them all.
A ; no author search does not return an entry unless the entry has at least one tag, even though author: null and artist: null return the entry.
-description: foo does not return any entries no matter what I've tried with my entries. -description: foo; description: bar also does not appear to work.
this query still doesnt work as I'd expect. Am I doing something wrong?
Sorry for not adressing this before.
You actually need to choose OR search type in GUI for it to work as you expect. Otherwise, if you leave it as AND, which is default value, you should recive about the same results as if you were using only ;
@samuellieberman
Thank you for pointing things out, will fix and add additional tests
Changes overview:
- Multiple minor changes from code review by @yedpodtrzitko
- New class
KeyNameConstantsfor constant names, replaces hardcoded field names like "UNBOUND" and "EMPTY" - Fix incorrect behavior with multi-word special flags like
no author - Fix negative prompt parsing
I also had to change search test from test_lib.py file, as --nomatch-- case worked incorrectly with negative prompt implementation.
Thank you for patience, will be waiting for new review.
@eivl
The intention behind this PR is to give the user the ability to search entries by multiple custom fields. Current version of TagStudio only supports search by tags and filename, with this PR it is possible to filter entries by other fields, such as Description.
At first, there was only one search_libraty method, that did all the work. What I did was creating new method parse_metadata that parses arguments from search query, and extracting functionality from search_library into a separate class, called Filter. On initialization, this class copies all necessary information from Library. After that, Filter.filter_results could be called to get all entries, that fit the parsed search query.
Reading the query
Library.parse_metadata method splits the query into multiple parts, returning a list of dictionaries:
- list element: query parts, which are separated by
|symbol in query - dictionary element: query keys, provided inside query parts and separated by
;symbol, and it's value (if:symbol is provided)
Query keys that do not have values added to the 'unbound' dictionary key (name set in KeyNameConstants.UNBOUND_QUERY_ARGUMENTS_KEYNAME constant), and later treated as a whole query in original code.
negative prompt values are extracted from unbound values and put in dictionary key, specified in KeyNameConstants.EMPTY_FIELD_QUERY_KEYNAME. These values filter out entries, containing non-null values of specified query key.
Filtering the entries
Filter.filter_results method contains a nested loop, that iterates over the output from parse_metadata.
It iterates over list elements (for query_part in split_query), then each entry (for entry in self.entries), and then for each entry it iterates over contents of dictionary (for key, value in query_part.items())
If the entry does not break loop, it is considerate valid for query and is added to list filtered_entries.
The filtered_entries list is created for each part of the query, and then added to pre_results list.
The pre_results is a list of lists, and each element pre_results[n] is a representation of parse_metadata element, as if the whole query was just `parse_metadata()[n]'.
pre_results elements are then combined into a single set, depending on SearchMode, selected in GUI.
I see a merge conflict, and will try to resolve it this weekend
I also would like to hear some tips to make code more readable. For me, this level of nesting is more or less comfortable and I do not know what to do with if statements in a loop at 2263 (for key, value in query_part.items():)
UPDATE: after further inspection, since the project moved to SQL, I will need a significant amount of work to adapt my changes to new structure, so I will not be able to finish the PR this weekend