RetroArch icon indicating copy to clipboard operation
RetroArch copied to clipboard

(Menu) Thumbnail Fuzzi-matching Filenames

Open i30817 opened this issue 3 years ago • 19 comments

First and foremost consider this:

  • Only RetroArch bugs should be filed here. Not core bugs or game bugs
  • This is not a forum or a help section, this is strictly developer oriented

Description

This RFE comes from the realization that although there is often 'a' thumbnail of a game in the respective database for the playlist/game entry, often there isn't 'the' actual right name. Especially if you make your own names from using the manual scanner or use a unusual set. Thus it would be good if you could add a fuzzy way to search for thumbnails to download. This isn't actually that error prone because as was mentioned, 95% of the time there is 'a' thumbnail for the game, but not the exact name. And you can maximize the chances of a match by 1) using lower case in both strings to match, 2) removing useless 'metadata extras' like '()[],' - and replace '_' by space - from both strings 3) use one of the more sophisticated fuzzy methods like 'Token Set Ratio' and set a arbitrarily high match ratio, like 90% for success.

You can actually use the github api to get the filenames without altering libretro-database to add a field to 'has a thumbnail' if you either keep a md5 of the libretro-thumbnails subprojects master at the time of retroarch built or have git available. You just need the 'master' md5 of the subproject ( git ls-remote https://github.com/libretro-thumbnails/ScummVM refs/heads/master for ScummVM for instance)

then add it to the api like this to get a json of the 3 top dirs of the thumbnail scheme: https://api.github.com/repos/libretro-thumbnails/ScummVM/git/trees/e517a20d3b275adceaa0fc71d3f54d37a946e995

then follow one of the urls of the dirs to get the images.

Besides that problem, there is another where is this feature requires a new dependency, probably the C++ port of fuzzywuzzy considering the retroarch requirements to widen the possible ports where the feature is available the most or port the function itself to C which i'm not sure is good because the c++ library would probably work in utf-8 without further trouble.

Fact is this wouldn't be necessary if the database was 'perfect', but the database is not and will never be perfect because there is no automated method that places copies/symlinks/hardlinks of the files that are already there to all the names on the database for variations of the game name where the image is the same. Volunteers - which are few and far between - just add their games, then give up, then eventually rename the files themselves, which is just a symptom of how bad it is in the less popular platforms without ready made thumbnail databases to leech from already.

This would solve that as well as make the manual scanner much less of a second rate citizen for thumbnails at least as well as 'non standard sets' (like TOSEC).

@jdgleaver what do you think?

i30817 avatar Jan 16 '22 11:01 i30817

Another detail is that sometimes users will place manual scan results into playlists that don't have the name of the 'platform', for instance if you want to separate one of the engines from scummvm into its own playlist. This should not be a problem to the sophisticated user though, because they can always replace the per game entry 'db_name' field in the playlist by one of the original, so if that is used as a pointer for 'which thumbnail database to search for names' users can workaround that problem, if it can't be handled by the manual scanner putting in the right name from the start.

i30817 avatar Jan 16 '22 11:01 i30817

Would be nice, but I could see this slowing down some of the browsing. Would likely need to cache the thumbnail paths directly in the playlists... I'd prefer to have the thumbnails match the playlist names directly. Would love your help in updating the thumbnail file names to match the database!

add it to the api like this to get a json of the 3 top dirs of the thumbnail scheme

We won't be making queries to the GitHub API from RetroArch. The API is rate limited, we have our own server for the thumbs over at thumbnails.libretro.com .

RobLoach avatar Jan 16 '22 21:01 RobLoach

You misunderstand, the api would just to download the 'current names' once per run (or something like that). If it can be part of the libretrodatabases (such as a field specifying that game/gamename has a corresponding image in the thumbnails server and retroarch doesn't have to download anything, so much better.

Even better is making sure that every database field has a corresponding image (if the game has that).

If you want to be 'quick and dirty' for that you can use the same idea 'just' by iterating over each db file and finding the highest match among the local thumbnail repository and putting those database filenames as thumbnails / symlinks to the highest match. Although i'm more interested in non-database names (thus the RFE here and not in the database) due to my recent experience of nearly all scummvm description names not triggering a thumbnail.

That said, it's maybe better to eyeball a log that isn't above say, 95% similarity and cut it off a 80% similarity to see if obvious errors don't pass.

If you want to do this, a simply python script can do it, because fuzzywuzzy is part of pypi you only need pip3 install fuzzywuzzy to get going. I'd try it but i'm download limited in my internet and i can't afford to download the whole thumbnail database to test it on, although i suppose i can test it on the api, which wouldn't be a problem here.

i30817 avatar Jan 17 '22 03:01 i30817

How do you iterate over libretro-database instances again? I know there was supposed to be a standalone executable for doing this but i can't remember what it is, how to build it and how to use it.

(i'm doing a little python script to iterate over them, and the thumbnail files - using the github api to not download the subrepositories) and create 'missing' relative symlinks pointing to the (currently non-existent) files already in libretro-thumbnails subproject with the highest similarity, and then it's a question of copy pasting the symlinks and commiting (which i believe turns those symlinks into 'real files' in the github remote).

Doing this on the database isn't 'as' useful as doing it on retroarch itself since it doesn't take care of sets not on the database, but it could be useful to make the database more complete by itself which would make the experience better in the non-fuzzy case too, and possibly even increasing the match percentage on the fuzzy case too (more filenames variations per game to test, more chance that one will be closer to the filename of a nonstandard set).

i30817 avatar Jan 17 '22 04:01 i30817

I kind of got it (libretrodb_tool), but now i have a doubt.

I have all the thumbnail filenames on the github thumbnails subprojects, but there are more rdb files than there are github thumbnails subprojects.

For instance you have a rdb file for PSP PSX2PSP projects but you don't have a thumbnail subproject for them. What to do in this case, just ignore those or attempt to also add thumbnails for them by mapping those rdbs into 'one of' the sets ? (which might require hardcoding).

i30817 avatar Jan 17 '22 06:01 i30817

+1 for this feature.

mrmatteastwood avatar Jan 17 '22 08:01 mrmatteastwood

BTW there is a bug on libretrodb_tool. It always returns with 1 (which is in the cmd line convention a error) even if it succeeded (because the main was lazy and put in a goto error that is always run at the end).

This has consequences when checking the value (such as in the python subprocess module ). So no checking for the script i'm doing right now.

i30817 avatar Jan 17 '22 10:01 i30817

RetroArch does indeed need fuzzy thumbnail matching. It's something I've thought about since I added the playlist-based thumbnail downloading functionality. It is a highly complex feature to add.

A couple of things to note first of all:

  • The github API is irrelevant here. We get thumbnails from https://thumbnails.libretro.com/ - that is also where any file listings will come from
  • The RDB databases are irrelevant here. The whole point of this is name matching. It needs to handle all content, not just items in the database

So a thumbnail search for one playlist entry would go something like this:

  • Get thumbnail listings (box, title, screenshot) for the current system by querying https://thumbnails.libretro.com/. We can't grab this for every request, so it needs to be cached to file locally. There will need to be some criteria for how often this is regenerated - an expiry time of 24 hours or somesuch. We also need to handle the case of scanning a playlist where every item is associated to a different system (i.e. we need to balance the overheads of reading the cache from disk vs. keeping everything in memory and potentially exhausting available RAM)
  • Once we have the listings, find the nearest match for each of the box, title and screenshot. This is hard. You link to fuzzywuzzy, which looks useful (and seems small enough to convert easily to plain C) - but I think it would require a large amount of tuning (especially considering the metadata in typical game names, which is needed for determining the best region, but otherwise needs to be identified and handled/removed to avoid improperly weighting any string comparisons). The likelihood of false positives is also very high, so that needs to be dealt with - it might even require some menu-based user interaction to allow a choice in the event that a match is unclear

If we can get that working, then it's fine - but then there's also the very serious consideration of performance. This kind of fuzzy searching is very CPU intensive and time consuming (vs. the current implementation). It certainly wouldn't be practical for the on-demand thumbnail downloader; it could only be enabled for the batch playlist-based one. And then in that case, allowing the user to chose a best image is problematic - the current task system has no facility for that, so some low-level redesign would be required.

Anyway - I think this is a necessary feature, but it will involve a very large amount of work.

jdgleaver avatar Jan 17 '22 11:01 jdgleaver

Oh i was talking about the database because this kind of thing can be applied to the database first to get better results. The more the sets are complete in the database, the more confidence a fuzzy match would have at runtime, and part of it seems to me to make the database more complete. And it seems a good test for the algorithm.

Anyway i have a script now but i'm not getting very good results from token_set_ratio. The problem shows in the first few lines

100% Zox 2099 (Loriciels).png -> Zox 2099.png
100% Zona 0 (Topo Soft) (Spain)[cr The Spanish Hacker](Alt 1).png -> Hacker.png
100% Zona 0 (Topo Soft) (Spain)[cr The Spanish Hacker].png -> Hacker.png
100% Zombi (Ubisoft) (France)[464 Version].png -> Zombi.png
100% Zombi (Ubisoft) (France) (Disk 2 of 2)[6128 Version].png -> Zombi.png
100% Zombi (Ubisoft) (France) (Disk 2 of 2).png -> Zombi.png

Two 'obvious' false positives Zona 0 (Topo Soft) (Spain)[cr The Spanish Hacker](Alt 1).png -> Hacker.png

I'll try the other methods too, but the 'solution' may indeed to remove with prejudice all the non-standard metadata. Or treat the 'pure name' as a single token (ie: Zona 0) wouldn't be divided.

i30817 avatar Jan 17 '22 11:01 i30817

Heh.

100% Turrican II - The Final Fight (Rainbow Arts) (Disk 2 of 2)[cr Genesis][t Genesis].png -> Final Fight.png
100% Turrican II - The Final Fight (Rainbow Arts) (Disk 2 of 2).png -> Final Fight.png
100% Turrican II - The Final Fight (Rainbow Arts) (Disk 1 of 2)[cr Genesis][t Genesis].png -> Final Fight.png
100% Turrican II - The Final Fight (Rainbow Arts) (Disk 1 of 2).png -> Final Fight.png

i30817 avatar Jan 17 '22 12:01 i30817

Yes, you very clearly highlight the difficulties here!

jdgleaver avatar Jan 17 '22 12:01 jdgleaver

edit: decided to open a bug report for this in libretro-thumbnails

i30817 avatar Jan 17 '22 14:01 i30817

@jdgleaver

I ended up with a scoring function with combines token_set_ratio and the longest common prefix and that seems to give 'ok' results, as in a corpus of 160 scummvm games, where the thumbnails are very incomplete and almost random in how the images that are there are named, it has about 3 false positives and and a 6 false negatives (or something like that). The script i use to transform scummvm.ini and download the thumbs is in diablodab bug about the scummvm update in the last post but in short:

Fuzz preoprocesssor and scorer functions
		def replacemany(our_str, to_be_replaced, replace_with):
			for nextchar in to_be_replaced:
				our_str = our_str.replace(nextchar, replace_with)
			return our_str
		
		#to maximize similarity:
		#remove roman numerals for numeric equivalents and 'The' for nothing
		#(note it does no harm to replace letters on both sides by different numbers)
		#remove 'Disc groups' which may cause disc numbers to be seen as part of the name
		#remove [] groups which have crack or dump metadata (redump doesnt use these but tosec does)
		#remove separators and _ to make the scorer have more tokens to identify
		#(also some remote thumbnails have space, some have _ instead of space)
		#turn into lower case and uniquify spaces
		def ffuzzthumbnail(t):
			e = re.compile(r'\([^)(]*(?:disk|disc)[^)(]*\)', re.IGNORECASE)
			t = re.sub(e, '', t)
			t = re.sub(r'\[[^]]*\]', '', t)
			t = t.replace('III',  '3')
			t = t.replace('II' ,  '2')
			t = t.replace('IV' ,  '4')
			t = t.replace('VIII', '8')
			t = t.replace('VII',  '7')
			t = t.replace('VI' ,  '6')
			t = t.replace('V'  ,  '5')
			t = t.replace('IX',   '9')
			t = t.replace('X',   '10')
			t = t.replace('I',     '1')
			t = t.replace(', The', '')
			t = t.replace('The ',  '')
			t = replacemany(t, '_()[]{},-', ' ')
			t = re.sub(' +', ' ', t).lower().strip()
			return t
		def myscorer(s1, s2, force_ascii=True, full_process=True):
			similarity = fuzz.token_set_ratio(s1,s2,force_ascii,full_process)
			#dont count spaces on the prefix heuristic
			st1 = s1.replace(' ','')
			st2 = s2.replace(' ','')
			#combine the token set ratio scorer with a common prefix heuristic
			#if however there is NO prefix (or a very short one) tank the similarity to zero.
			#This prevents false positives where there is a later part of the string that is 'very similar'
			#which token set ratio is prone to because it sets score to 100 if one string words are completely on the other
			#the sum makes 'longer' matches have more weight
			prefix = len(os.path.commonprefix([st1, st2]))
			if prefix <= 2:
				return 0
			else:
				return similarity + prefix	
	remote_names = set()
	remote_names.update(thumbs.Named_Boxarts.keys(), thumbs.Named_Snaps.keys(), thumbs.Named_Titles.keys())	
	thumbnail, i_max = process.extractOne(name, remote_names, processor=ffuzzthumbnail, scorer=myscorer)

The first function is a auxiliary, the second preprocessor, the third is the scorer. The transformation of roman numerals to numbers (that is incomplete yet) is to make thumbnails with 'II' and '2' in the name more similar, removing the two forms of 'the' is the same. The removal of underscores and dividers is to make the tokenizer of the token_sort_ratio have more 'common things' to find like USA tags without those being glued to other metadata which would confuse it. The scorer simply adds the longest common prefix to the score (which can go over 100% like this so it's not really a percentage anymore) as a sanity test (otherwise token_set_ratio prefers shorter candidates which are completely inside the source and assigns them 100). To do that i remove spaces just to be sure they don't influence either the score added or fail the LCP substring earlier than it's supposed to.

Then i use process.extractOne on the set of all 'possible name keys of a download' (of the 3 possible directories to do it all at once). Also 'fuzzywuzzy' is the old name of the project, it moved to 'thefuzz' and i hadn't noticed.

Of course easier said than done to do this kind of heavy string manipulation in C.

i30817 avatar Jan 20 '22 03:01 i30817

@i30817 Thank you. These are good results, and an excellent starting point for any libretro implementation.

jdgleaver avatar Jan 21 '22 10:01 jdgleaver

I decided to do a pip installable project just for this. Instead of all the ad-hoc methods i experimented with this uses retroarch.cfg and the playlists to get names.

https://github.com/i30817/libretrofuzz

I also make it a bit more flexible and improved both the normalizer and the scorer, although i feel like false positives from when the server doesn't have the game but has a sequel, prequel or very similar named game are probably never going to go away.

i30817 avatar Feb 05 '22 05:02 i30817

This is excellent work. Thank you!

jdgleaver avatar Feb 06 '22 11:02 jdgleaver

Any progress / ETA on this!? I'm not getting any hits on my "region-less" rom filenames

davesauce14 avatar May 03 '22 19:05 davesauce14

Use my utility - which has changed to be a bit more reliable/less buggy meanwhile (with libretro-fuzz --no-meta), this isn't being worked on atm by jdgleaver afaik.

It has a very obvious weakness where if the server does not have the thumbnail, but there is a sequel or prequel or very similar game on the same server directory, it gets 'chosen' erroneously, but i think there is nothing to be done about that, except add the game thumbnails to the server.

i30817 avatar May 03 '22 23:05 i30817

btw, i updated that utility again, and when you install it you have two commands, libretro-fuzz, for single systems, and libretro-fuzz to download from all systems, as long as the playlist name convention is the name of a retroarch system and not a custom playlist name (with the fuzzing settings you chose obviously, which might not be the best for all platforms). I also fixed a bug which was giving false negatives to multiple values 'very close' to a title, and added options to skip downloads or quit in the middle.

It's complex enough that i doubt a minimal version of the functionality in retroarch itself could be 'as effective' (although i'd still like it for accessibility and for the users that have no idea this even exists).

i30817 avatar Jul 19 '22 09:07 i30817