TagStudio
TagStudio copied to clipboard
[Feature Request]: Improving search
Checklist
- [X] I am using an up-to-date version.
- [X] I have read the documentation.
- [X] I have searched existing issues.
Description
I know this is somewhat on the roadmap, but I thought I would share some specifics of how search should be improved. It is very important that work on search functionality starts early, to make sure a system is developed in a way that supports the future implementation of features (ie, we dont want to be in a situation where implementing a standard search feature would require major changes)
Solution
End goal: A fully featured search system. This could make use of Elastic and/or Opensearch. Desirable qualities:
- A Standard search syntax
- should be similar to Boorus and search engines - no one wants to learn a new syntax
- Grouping of items
- Many implementations of grouping treat each group as its own search, combining results are it goes up.
- Boolean operators (AND, OR, NOT) of items and groups
- Reserved characters and names:
- there should be reserved characters, such as quotation marks and colons, that mean something in search syntax. To search the character itself, it will need to be escaped.
- Any assignable field should be searchable. For example with
filename:<query>- this should able to be used with any given type of field. Of course, by default, it is assumed to be a tag.
- Searching text fields:
- RegEx seems like the easiest way to do this. Not everyone knows it, but it could be a mode.
- Handling entries with spaces (#112) - there are 2 ways this is usually dealt with:
- Make all spaces actually underscores: if your tag is
a happy new yearit becomesa_happy_new_year, but the underscores do not get shown to the user. This means that spaces and underscores are effectively the same. - grouping using quotations:
"a happy new year"- however, this means that items with a quote in them will need to be escaped: a tag named"wow thats cool"would need to be searched with something like"\"wow thats cool\""instead.
- Make all spaces actually underscores: if your tag is
Alternatives
No response
I agree with your suggestions on this, however there's already several open issues and PRs pertaining to these features. Boolean search is currently being tracked via #225 + #314 and is being implemented in #284/#310 with metadata search being tracked via #272 and being implemented in #284. I believe that grouping, escaping characters, and addressing tags with spaces are all coming with one or more of these PRs as well.
I don't like the accidental nature of PRs for such a fundamental feature, without settling on the design first. (And I'm not speaking about UI.) Both PRs went with very different considerations in mind.
yes
I'd probably go as far as suggesting to make a (E)BNF description of the search syntax before attempting to implement it. And there are other considerations as well, such as familiarity to random user, syntax extensibility and compatibility with possible UI helpers.
I was actually thinking about this - in all honestly, I might make a PR of a markdown file detailing search syntax. Though, I still have not used TagStudio enough to be confident in it. If we want to do this, some things I would like is a general consensus on reserved characters and such, though we could just update it before merging it.
perhaps something like this :
Search syntax
This section describes the (planned) search syntax used in TagStudio
General structure
Searches are parsed from the inner-most group outward, then left to right. Eg, a (b (c) d), c is parsed first, then b, then d then a. Terms are space-seperated, using " to escape spaces where needed. Note that the entire query can be treated as a large group.
Boolean operators
By default, search terms are in the AND mode. For example, cat fox would find entries with both cat and fox.
NOTing terms: This is almost always done with the-symbol, though occassionally done withNOTor!.cat -fox.ORing terms: Common implementations:or(english/python style,cat OR fox) - advantage or being readable, at the cost of visibility (it looks like normal text). This also means that you need to parse to change which category a term is in when you findOR, and building searches linearly.|or||(programing style,cat | fox) - advantage of being visible and easy, but you still need to do more parsing.~(Booru style,~cat ~fox) - this has the advantage of making it obvious which items are included in the OR and which are not, which is good for users who make not be familiar with orders of operations (dog ~cat ~foxvsdog cat | fox). Disadvantage of not being able to start tag/text with~without escaping it.
- Grouping - This is almost always implented with
()- eg,~(cat -fox) ~(fox -cat)would act as an exclusive OR - it finds items that have only one ofcatandfox. Without grouping, this would take multiple searches. - Special operators:
[]is reserved for possible future functionality with nested tags.- An entry with
big foxandlittle catwould match any combination of(~big ~little) (~fox ~cat). This term might be used for something likefox[big] cat[little]to specify that a nested tag must belong to a specific parent.
- An entry with
Searching fields
By default, search terms apply to tags. There are a few special exceptions:
emptyandno fieldsfind entries that have not had any information attached.untaggedandno tagsfind entries that may or may not have had information attached, but none of them are tags.
Common field attributes:
Fields have attributes too - searching them is commonly done in these ways. Note that <field> and [options] exist to demonstate the syntax, and are not literals.
- Boolean:
has_<field>:[True|False]- search whether an entry has a field on it. Eg,has_date:Falsefinds entries that do not have a data attached.<field>_count:[value]- search based on how many items are in a field (eg, how many tags there are)in_<field>:[True|False]- the field is a collation.in_<field>:Falsematches all entries that are not in ANY collation
- Matches:
<field>:[value]- find entries where the field has a given value.- Default:
<field>:"[value]"finds entries where the field CONTAINS the value for text entries. Using*matches all here (eg,a_*_catmatches anything where*is) - RegEx: Using RegEx syntax as the value (
/regex/) searches with regex. Limited to text entries.
- Default:
- Comparisons and Ranges: (numerical values):
- BETWEEN:
<field>:[min]..[max]searches for entries wherefield's numerical value is betweenminandmax, inclusive. - Inequalities:
field:followed by>,<,>=, or<=search for their respective inequalities. Eg,date:<2024finds where the date is before 2024.
- BETWEEN:
order:[term|<supported field>]- This special opperator specifies how the search should be sorted. In the case where is is specified multiple times, the first takes precedent (ie, subsequent items are sub-sorts, applied where thevaluereturns the same after the first pass).orderis processed last, and does not support grouping (do to negations andORs being applied to groups.)- terms might include
random,tagCount,aspect_ratio,duration, etc. which can be use to order results as desired. - supported fields can also impact the order:
dateAdded,titleare examples. - If multiple
orderterms are present (eg,order:tagCount order:title), they are processed in the order of the operations. The items are sorted using the first order in the group (bytagCount) - then any items that have the sametagCountare sorted bytitle.
- terms might include
Tag/field rules
Generally, tags with characters that are used for search (spaces, ", :, ~, -, (), /, etc.) should be avoided if possible. Specifically, in the following situations, they will need to be escaped:
"- Will always need to be escaped.- Tags starting with boolean operators (
~,!and/or-, etc.) will always need to be escaped - In tags that contain
:, the:must be followed by either_, a space, the colon must be escaped, or the tag must be escaped in quotes.- This excludes the first character of the tag.
- Any tag containing unmatched grouping symbols
(should be escaped. - Text fields starting/ending in
/will need to be escaped.
Very nicely written @mm12. I have just one big problem with your suggestion.
Tag rules
Generally, tags with characters that are used for search (spaces,
",:,~,-,(),/, etc.) should be avoided if possible. When searching items with these characters, most will need to be escaped with\. Spaces can also be escaped by putting the tag name in quotes ("tag name here")
A lot of the suggested forbidden characters are super useful and popular in tag names and shouldn't have to be escaped. Some notable tag examples from Danbooru: :3, \(^o^)/, fate_(series), fate/grand_order, girls'_frontline, two-tone_hair...
The only restrictions I support are the following:
- All whitespace in tags must be escaped.
- Tags that start with
~,-,!or an unmatched(must be escaped. - Tags that are exactly equal to an existing operator must be escaped.
- Tags that exactly overlap with a field search must be escaped.
- This should only apply for currently existing fields, so
honkai:_star_railshould not need to be escaped unless there is an entry with ahonkaifield.
- This should only apply for currently existing fields, so
And if we want to allow wildcards in tag searches:
- All asterisk
*characters must be escaped.
Also, can you please clarify what you meant when you said this?
- Special operators:
[]is reserved for functionality with nested tags. Currently, all functionality can be achieved with grouping and basic boolean operators.
And I don't understand what you meant when you said this:
order:[value|<supported field>]- This special opperator specifies how the search should be sorted. In the case where is is specified multiple times, the first takes precedent (ie, subsequent items are sub-sorts, applied where thevaluereturns the same after the first pass).
A lot of the suggested forbidden characters are super useful and popular in tag names and shouldn't have to be escaped. Some notable tag examples from Danbooru:
:3
Good point. To clarify, I didn't mean they should always be escaped, just that they often will, in some contexts. Emote tags (:3, \(^o^)/, etc.) will be fine. Needing to escape them is contextual. It matches the restrictions you suppose, with the exception that in the case of things like abc:efg, it does not matter if abc is a valid field or not. Unless it is followed by an _ or an escaped space, it must be escaped.
o\:3 o:_3 "o: 3" "o:3" - OK
o:3 - not ok
Also, can you please clarify what you meant when you said this?
All the functionality that the current tag system supports can be described by the the spec (from my understanding). In the case where the system is upgraded, [] are there to aid in tag parent/child relationships.
And I don't understand what you meant when you said this the
ordermetatag allows you to specify the order of results. Potential values might be terms such asrandom,tagCount, etc., or might be fields that support sorting, such asdateAddedortitle
I will edit the comment to clarify these.
Hey @mm12, I just got done implementing some field search syntax in my PR #310, and I was hoping to get your input on it.
Specifically, I implemented the following:
<field>:[value]
has_<field>
has_<field>:[True|False]
I also have boolean operators implemented pretty much identically to your suggestions from previous commits, except all tags and parentheses need trailing whitespace in my syntax. I haven't implemented quotation marks, wildcards, or regular expressions in my syntax.
I notice that you still say that unmatched parentheses and colons without spaces should be escaped in your current spec. In order to try to understand your reasoning, my current PR ignores these restrictions. If you clone my repository, try searching for >:) or NieR:Automata to see what I mean. I am very interested to hear if you would point out any issues with my syntax due to allowing these sorts of tags without escaping. Thanks!
Hey @mm12, I just got done implementing some field search syntax in my PR #310, and I was hoping to get your input on it. It looks like it is what we need right now - adding functionality that is needed but lacked.
Though, what I am looking to do is make this application scalable in terms of entry count and search complexity. To address this, I have 2 suggestions that can be a starting point here:
- Storing stuff in a giant JSON file (and parsing it) costs a lot. We should really be using SQL or something. The DB file can be stores in the same place.
- This takes some effort. If we do this, I would hope we would have better import utilities than we have now.
- Writing our own search sucks. We should use something like Elastic or OpenSearch. Though, my experience with them is in Rails, not Python, so take that for what you will.
I have been thinking a lot about how tags, fields, and field contents are identified. Currently, @mm12's suggestion has quotation marks used to facilitate a more literal representation of tag and field identifiers:
General structure
[...] Terms are space-seperated, using
"to escape spaces where needed. Note that the entire query can be treated as a large group.
And @mm12's suggestion has a regular expression option for matching field contents:
Common field attributes:
- [...]
- RegEx: Using RegEx syntax as the value (
/regex/) searches with regex. Limited to text entries.
My suggestion for string matching is this:
Use Cases
Tag identifiers, field identifiers, and field content would all use the exact same text matching system, except that field content would match possible substrings rather than needing to match the whole string like in the other two cases.
Delimiters
Surrounding an expression with a delimiter would allow users to include whitespace while giving an indicator for the syntax. Multiple delimiter options gives users a way to avoid unnecessary escaping.
- Single quotes
' - double quotes
" - back quotes
` - forward slashes
/
Different Syntaxes
- A wildcard syntax as the default without delimiters, or when using the prefix
WCorwildcardwith delimiters.- This would allow users to omit whitespace, or to replace it with underscores
_or dashes-. - This should also be case insensitive and allow users to omit punctuation, so that
McDonald'scan be searched asmcdonalds. This is inspired by the current (albeit not working) system in Tag Studio: https://github.com/TagStudioDev/TagStudio/blob/ce87b11fbd642e688f0856e055fb77257eff15c1/tagstudio/src/core/utils/str.py#L6-L26 - Additionally, this would introduce an asterisk
*as a wildcard to match zero or more of any characters. This is inspired by the wildcard features that some boorus have, and doesn't require any escaping by default. - This is similar to glob patterns, but I see implementing the other glob like wildcards like question mark
?as a single character wildcard, or square brackets[...]as character classes to be unnecessary and counterproductive.
- This would allow users to omit whitespace, or to replace it with underscores
- A literal syntax as the default when surrounded by single quotes
', double quotes", or back quotes`, or when using the prefixLorliteralwith forward slash/delimiters.- This would be case sensitive, and require every character to match, including whitespace and punctuation.
- A regular expression syntax as the default when surrounded by forward slashes
/, or when using the prefixREorregexwith other delimiters.- This would essentially just pass the user's text to Python's (or SQLite's) regex engine with minimal reformatting.
Escaping
Escaping using backslash is actually kind of horrifying in this context, because there are three scenarios we would need to simultaneously accommodate with our system:
- Some number of backslashes appearing at the end of the string
- Some number of backslashes followed by the delimiter appearing in the middle of the string
- A significant number of backslashes being used to escape portions of a regular expression string
If we wanted to escape every backslash purely for the sake of simplifying the first two cases, then we are creating backslash hell for anyone who wants to use backslashes in regex syntax. And if we wanted to "pass through" backslashes to our syntax, we would have to do so selectively in order to accommodate the first two cases. Then we would suddenly have very different rules for strings of backslashes followed by delimiters compared to strings of backslashes standing on their own.
For these reasons, I prefer a "padding" approach. The only rules the user needs to know are that all pairs of delimiter characters are reduced to a single character, and that if they don't escape a delimiter character, then it may end up being interpreted as the end of the string. (Specifically if it is followed by whitespace, the end of the search query, or, when not matching field content, a colon : character.)
Has Field
I'm doing away with the has_<field>:[true|false] syntax and replacing it with a has:<field> syntax. has_<field>:false can be replicated with NOT has:<field>
Closing in favor of #600