Name matching mode between EXACT and CLOSEST MATCHING
Can we get a matching mode that's somewhere in between EXACT and CLOSEST MATCHING?
Rationale is as follows:
- Comparison data is between filesystem (folder name) with its usual caveats, vs web sources.
- Sources can use any symbols.
- File systems cannot, and even for some symbols it can handle, it's usually not a good idea to use it.
Example:
The : character is invalid as part of a folder name on the usual file systems, but is often part of a series name.
This will mean Tanaka: Nenrei Equal Kanojo Inaireki no Mahoutsukai will never have match due to the :.
Some people will have it on their disk as Tanaka; Nenrei Equal Kanojo Inaireki no Mahoutsukai, while others will have it as Tanaka Nenrei Equal Kanojo Inaireki no Mahoutsukai, and yet a third may have it as Tanaka - Nenrei Equal Kanojo Inaireki no Mahoutsukai
Closest match is undesirable due to the false positives, where as exact is not "loose enough" due to the above constraints.
So what I propose is to get a 3rd matching mode that is in-between.
It would work as follows:
- scan strings on both sides, and discard characters in the following range (i.e. discard all special characters except 'space'):
0x00-0x1f, 0x21-0x2f, 0x3a-0x40, 0x5b - 0x60, 0x7b - 0x7e - reduce any consecutive spaces to a single space AFTER the discard (or even discard spaces?).
- perform an exact string match on the resulting string.
- maybe the ignore set can be configurable by the user. With the default behaviour set to discard all of those specified above, but user can select a smaller set or larger set to discard via configuring less/more characters in the discard set. i.e. there's the ascii tilde
~character that is captured by the above region, and one in the extended unicode set that's longer and looks slighly different. This default set would not catch the unicode long tilde, for example, but if the user can configure it to be added to the ignore set, the case would be handled.