[Feature] De-duplicating auto-taggable strings

Open kermieisinthehouse opened this issue 4 years ago • 0 comments

Is your feature request related to a problem? Please describe. Autotag builds regexes of Tag names, Tag aliases, Studio names, Studio aliases, and Performer names. We currently do only the most rudimentary matching, which leads to problems. Having two performers, "Jane" and "Jane Doe", will match both. There are many complaints of 'single name performers' causing problems. Tag aliases currently allow many different tags to match against the same string.

⚠️ This is a blocker for a conversion of Performer aliases from freeform comma separated text to auto-taggable individual entries.

Describe the solution you'd like We create regexes with a small subset of regex features, and they are generated deterministically and predictably. When creating a tag name, tag alias, performer name, studio, or studio alias, we should perform a check to see if the generated regex either describes a regular language that is a sub- or super-set of the language described by the regex generated by another token. This is called the "regex inclusion problem" in the literature, although implementations are few and far between.

Regex inclusion will probably require implementing this as a library. Some resources: https://stackoverflow.com/questions/18729015/determining-whether-a-regex-is-a-subset-of-another https://stackoverflow.com/questions/6363397/how-to-tell-if-one-regular-expression-matches-a-subset-of-another-regular-expres https://math.stackexchange.com/questions/283838/is-one-regular-language-subset-of-another An implementation in python: https://github.com/qntm/greenery (see isSubset() or isEquivalent()) An academic paper presenting an algorithm in PTIME: https://www.duo.uio.no/bitstream/handle/10852/9053/reinclusionJCSS.pdf

Dec 20 '21 10:12 kermieisinthehouse