TagStudio icon indicating copy to clipboard operation
TagStudio copied to clipboard

Add search by metadata

Open lunaro-4 opened this issue 1 year ago • 15 comments

Context

Related to issue #272

Features / Changes

  • Major refactoring over core.library.Library.search_library method
  • Add method for parsing keys for file metadata from search query
  • Add support for negative query

Usage example

In a new version, specifying all tags with whitespace as usual will return result from AND search type, regardless of one selected in UI:

tag1 tag2 - return all entries, tagged with 'tag1' AND 'tag2'

If you want to use OR, you need to specify OR search type in UI, and provide tags, separated by '|' symbol

tag1 | tag2 - (with OR search type) return all entries, tagged with 'tag1' OR 'tag2'

tag1 tag2 | tag3 - (with OR search type) return all entries, tagged with 'tag1' and 'tag2' OR with 'tag3'

Search by non-tag metadata is also available by 'key: value' syntax:

description: desc - return all entries, which description contains 'desc'

tag1 ; description: desc - return all entries, tagged with 'tag1' AND which description contains 'desc'

description: desc | description: smth:

  • (with OR search type) return all entries, which description contains either 'desc' or 'smth'
  • (with AND search type) return all entries, which description contains both 'desc' and 'smth'

Negative query can be set by specifying lookup value as 'null':

tag1; description: null - return all entries, tagged with 'tag1' and not having a description (or having a null description)

Original negative query tags like empty or no author also supported:

no author tag1; description: desc - return all entries without author, tagged with 'tag1' and description containing 'desc'

Refactoring note

For ease of maintenance, many functions of original code were extracted to new methods in library.Library class. To avoid clutter, I extracted them even further to a new class 'library.Filter'. This should help maintain the codebase in future, but due to lack of tests, I decided to extract to 'Filter' only new methods, and import any required variables and methods, to aviod breaking other functionality of 'library.Library' class:

class Filter:
    def __init__(self, lib: Library) -> None:
        # self.lib = lib
        self.ext_list = lib.ext_list
        self.entries = lib.entries
        self.is_exclude_list = lib.is_exclude_list
        self.get_field_attr = lib.get_field_attr
        self._tag_strings_to_id_map = lib._tag_strings_to_id_map
        self.get_tag_cluster = lib.get_tag_cluster
        self.library_dir = lib.library_dir
        self._field_id_to_name_map = lib._field_id_to_name_map
        self.get_field_obj = lib.get_field_obj
        self.missing_files = lib.missing_files

This class was added in last commit, so this can be reverted with ease.

Other possible side effects

Changed code works fine at a first glance, but it is not clear for me how exactly original code worked on some edge cases, especially with coalitions.

I can write tests for it, but I would like to have test cases, for deeper understanding on how should it work

lunaro-4 avatar Jun 13 '24 09:06 lunaro-4

cherry-piked commit with tests All github actions passed except for ruff, possibly because of files I didn't touch. On my machine, I get no errors on files I edited:

# at TagStudio/tagstudio
> ruff check --config ../pyproject.toml src/core/library.py
All checks passed!
>  ruff check --config ../pyproject.toml tests/core/
All checks passed!

PR is ready for review

lunaro-4 avatar Jun 14 '24 14:06 lunaro-4

@lunaro-4 can you please rebase your branch? There are some changes which are already merged in main branch, eg .github/workflows/pytest.yaml

also fix the formatting in your code, so the ruff workflow pass, thanks

yedpodtrzitko avatar Jun 15 '24 00:06 yedpodtrzitko

@yedpodtrzitko

  • [x] Rebase done
  • [x] All checks passed

lunaro-4 avatar Jun 15 '24 07:06 lunaro-4

for the negative syntax it would be nice if it work the same way as with google, ie. if I want to exclude something, I could write -author:foo instead of not author:foo. But that's just wishful thinking, the decision is not up to me.

yedpodtrzitko avatar Jun 15 '24 09:06 yedpodtrzitko

Thank you for the feedback, I will comment again with a mention in a few days, when I apply your suggestions

lunaro-4 avatar Jun 15 '24 14:06 lunaro-4

Would getting the field refactor in #157 tested and pulled first help with this PR, or at this point would it be like pouring water over a grease fire? (Not to compare this PR to a fire, it's just a big feature on top of some of my worst code in the pre-existing codebase). Since the database will be migrating anyway at some point, it wouldn't matter if the existing field types were locked in as constants that would be easier to work with while we're still using JSON.

CyanVoxel avatar Jun 16 '24 04:06 CyanVoxel

Would getting the field refactor in #157 tested and pulled first help with this PR, or at this point would it be like pouring water over a grease fire? (Not to compare this PR to a fire, it's just a big feature on top of some of my worst code in the pre-existing codebase). Since the database will be migrating anyway at some point, it wouldn't matter if the existing field types were locked in as constants that would be easier to work with while we're still using JSON.

any substantial PR will be most likely working with library.py, so there will never be better time to merge it than "now", as the number of potential conflicts in code will be always teh same, but working with the constants instead of magic numbers will make it easier to work with the existing code. I just rebased the PR to fix the conflicts, so from my point of view it's ready for merge.

yedpodtrzitko avatar Jun 16 '24 04:06 yedpodtrzitko

@yedpodtrzitko @CyanVoxel Thank you for your patience, I had some IRL stuff last week and only now am able to return to the project.

Here is overview of last changes:

  • Proper negative prompt: you can specify field contents, you want to exclude, by specifying field name with - in front of it:
    • -description (in part without key specified) will exclude any entry with description field
    • -description: foo will exclude all entries, containing foo in description.
    • -description: foo; description: bar will select all entries, that contain bar but do not contain foo in description field
  • Minor code refactoring: applied feedback from code review:
    • entity metafields' data is not lost, if entity has multiple fields with same ID
    • some original code rewritten for better readability
    • new SpecialFlag class
  • Prepared for merge
    • [x] New code covered with tests
    • [x] Merged with one of recent commits
    • [x] All GitHub workflows green

I can also write update for README, to clarify new search syntax, if you approve my changes

lunaro-4 avatar Jun 26 '24 15:06 lunaro-4

Words with capital letters in text fields cannot be searched.

If the description is Banker goes to the circus then searching description: goes to the circus returns the entry, but description: Banker does not. Only the capitalization in the original field matters, capitalization in the search still does not matter. Searching description: GOES to THE CiRcUs still returns the entry.

I assume this bug applies to the Title, Author, Artist, URL, Description, and Notes fields, though I did not check them all.

samuellieberman avatar Jul 14 '24 18:07 samuellieberman

A ; no author search does not return an entry unless the entry has at least one tag, even though author: null and artist: null return the entry.

-description: foo does not return any entries no matter what I've tried with my entries. -description: foo; description: bar also does not appear to work.

samuellieberman avatar Jul 14 '24 19:07 samuellieberman

this query still doesnt work as I'd expect. Am I doing something wrong?

#284 (review)

Sorry for not adressing this before. You actually need to choose OR search type in GUI for it to work as you expect. Otherwise, if you leave it as AND, which is default value, you should recive about the same results as if you were using only ;

lunaro-4 avatar Jul 22 '24 17:07 lunaro-4

@samuellieberman

Thank you for pointing things out, will fix and add additional tests

lunaro-4 avatar Jul 22 '24 17:07 lunaro-4

Changes overview:

  • Multiple minor changes from code review by @yedpodtrzitko
  • New class KeyNameConstants for constant names, replaces hardcoded field names like "UNBOUND" and "EMPTY"
  • Fix incorrect behavior with multi-word special flags like no author
  • Fix negative prompt parsing

I also had to change search test from test_lib.py file, as --nomatch-- case worked incorrectly with negative prompt implementation.

Thank you for patience, will be waiting for new review.

lunaro-4 avatar Jul 27 '24 18:07 lunaro-4

@eivl The intention behind this PR is to give the user the ability to search entries by multiple custom fields. Current version of TagStudio only supports search by tags and filename, with this PR it is possible to filter entries by other fields, such as Description.

At first, there was only one search_libraty method, that did all the work. What I did was creating new method parse_metadata that parses arguments from search query, and extracting functionality from search_library into a separate class, called Filter. On initialization, this class copies all necessary information from Library. After that, Filter.filter_results could be called to get all entries, that fit the parsed search query.

Reading the query

Library.parse_metadata method splits the query into multiple parts, returning a list of dictionaries:

  • list element: query parts, which are separated by | symbol in query
  • dictionary element: query keys, provided inside query parts and separated by ; symbol, and it's value (if : symbol is provided)

Query keys that do not have values added to the 'unbound' dictionary key (name set in KeyNameConstants.UNBOUND_QUERY_ARGUMENTS_KEYNAME constant), and later treated as a whole query in original code.

negative prompt values are extracted from unbound values and put in dictionary key, specified in KeyNameConstants.EMPTY_FIELD_QUERY_KEYNAME. These values filter out entries, containing non-null values of specified query key.

Filtering the entries

Filter.filter_results method contains a nested loop, that iterates over the output from parse_metadata.

It iterates over list elements (for query_part in split_query), then each entry (for entry in self.entries), and then for each entry it iterates over contents of dictionary (for key, value in query_part.items())

If the entry does not break loop, it is considerate valid for query and is added to list filtered_entries.

The filtered_entries list is created for each part of the query, and then added to pre_results list.

The pre_results is a list of lists, and each element pre_results[n] is a representation of parse_metadata element, as if the whole query was just `parse_metadata()[n]'.

pre_results elements are then combined into a single set, depending on SearchMode, selected in GUI.

lunaro-4 avatar Sep 14 '24 12:09 lunaro-4

I see a merge conflict, and will try to resolve it this weekend

I also would like to hear some tips to make code more readable. For me, this level of nesting is more or less comfortable and I do not know what to do with if statements in a loop at 2263 (for key, value in query_part.items():)

UPDATE: after further inspection, since the project moved to SQL, I will need a significant amount of work to adapt my changes to new structure, so I will not be able to finish the PR this weekend

lunaro-4 avatar Sep 14 '24 12:09 lunaro-4