BizHawk icon indicating copy to clipboard operation
BizHawk copied to clipboard

Disc-hashing of non-PSX discs

Open adelikat opened this issue 6 years ago • 18 comments

Open a saturn game and click the rom status icon in the staus bar, nothing happens.

This is also true of any core that takes a list of roms (for some there's both a single rom and multi-rom constructors and it only happens for multi-rom).

Either we need the client to take a stab at implementing RomAnnotationDetails for multi-roms or each core affected needs to implement that logic

adelikat avatar Dec 09 '17 20:12 adelikat

see #1152

Asnivor avatar Mar 22 '18 15:03 Asnivor

This now works as of a few months ago.

The last thing is that disc-based 'ROMS' just show their hashes as 'N/A' (because thats what I did at the time).

Can we have a discussion on what we should be showing here in this instance?

  • cue hashes are pointless
  • bin/img/iso hashes can be expensive I would imagine
  • iirc PSX has some blackmagic internal hash thing that @zeromus mentioned to me a while back. Do we think about using this?
  • is there such a thing as a 'known good dump' when it comes to disc images?
  • do we just not bother....

Asnivor avatar Aug 23 '18 16:08 Asnivor

cue is the same situation as bin/img/iso. It's the mounted disc which is hashed. All of the disc images have a black magic hashing method (although some like PSX are far blacker than others). I did generate a known good list of PSX roms by running my blacker magic hashing method on all of redump. It's in our gamedb.

We should be able to show a disc "hash" in the rom status. These are useful for identification purposes. Thats what the movie tools should be doing to confirm the "correct" game is probably being used. Of course, it can't be known for 100% sure unless the entire disc's contents are hashed. For now, a weaker identifying hash is all we can do.

For all discs we should (maybe we arent, I dont know) using a hash based on the TOC, plus the first several sectors. First several sectors was chosen by vecna originally. I determined how many sectors were required to uniquely identify all the PSX games (part of blacker magic) although it would be possible to trick it.

A more robust identification would involve parsing the filesystem, hashing that, AND identifying the executable, and hashing that. The goal is to hash as much as we can, except for the last bit of hashing everything. While keeping in mind romhacks are likely to only change the EXE, and bad rips are likely to have different TOCs before anything else is wrong.

However, we're probably not changing any hashing methods now. I'm just telling you what my grand plan was.

At any rate we DO have a disc identifying hash. We should display that. How we indicate that it's a disc identifying hash and not a complete verification hash, I don't know. I had thought we should offer to compute a complete verification hash, and people should be able to put that in movies--people that want to be sure can run the verification hash on their media to check. Without that, only the cursory identification hash would be used.

zeromus avatar Aug 23 '18 21:08 zeromus

@zeromus

So should we be using DiscHasher.Calculate_PSX_BizIDHash() for non-PSX discs as well? Or DiscHasher.OldHash()?

Asnivor avatar Sep 11 '18 08:09 Asnivor

The right way to do this is to do the same analysis I did for the PSX: use the PSX hash but adjust the 26 as needed while running it on the whole saturn corpus until an appropriate value is found (or just use 26 again if that works, so that we can use the same hash for both systems). At that time the gamedb also needs to be built. I already know how to do that, so you can either assign this to me in order to make me do it, or tackle it yourself.

note: DBMan's DiscHash.cs seems to be what I used to perform this analysis. I think it would be nice if we can improve that tool to formalize this process, so that the analysis can be performed again on an updated corpus. Although keep in mind that if the hashes ever change, people's movies would contain obsolete disc ID hashes. This isn't the end of the world, though.

Also, if we DO ever change the hashing approach, we should go ahead and try the filesystem and EXE thing. However, since I just thought of the 'obsoleting hashes in movies' argument, that may be justification for doing the work now, for saturn. I think this cdfs-and-exe thing may suit your interests well, but it's up to you.

zeromus avatar Sep 11 '18 18:09 zeromus

I think i'll have to push it to you for now. The 'whole ss corpus' thing is where I will struggle at the moment.

Asnivor avatar Sep 13 '18 10:09 Asnivor

Since https://github.com/TASEmulators/BizHawk/commit/32e8afcedca2820af301dbed0eafe29b45e532eb, it's now not super expensive to hash the entire disc. Generally it should take less than a second to hash the entirety of the disc. We probably can just do full disc hashes for every disc and scrap the identifying hashes.

CasualPokePlayer avatar Aug 28 '22 06:08 CasualPokePlayer

Keep the one which matches redump.org still

YoshiRulz avatar Aug 28 '22 07:08 YoshiRulz

Just reading 700MB should take more than 300ms. I'm skeptical of the analysis. You need to run it on "cold" discs. Or do analysis that incorporated the reading time in the first place. The scenario of waiting seconds while a game is opened to make 100% sure it's something our database has is not something that casual users care about, only TASers and such. Maybe it's as simple as having an option to do complete verification of the discs at load time. Although I am still in favor of doing it in the background somehow and offering it as an advisory info-tip "! complete disc verification failed !"

zeromus avatar Aug 28 '22 07:08 zeromus

To clarify, around 270-275 ms of that 369 ms is spent actually reading the disc from what I measured. And the old CRC32 algo took over 4 seconds itself. What do you mean exactly by cold discs?

CasualPokePlayer avatar Aug 28 '22 07:08 CasualPokePlayer

On the doing it in the background, I assume that is possible (with a separate thread given a new Disc object for that disc). (EDIT: No we can't. Probably. Hash is used for obtaining entry in gamedb, so that's blocking). Even so, we could still parallelize the reading and hashing and that should offer a bit of a performance boost (although I'm not so sure how worth it that is).

Also, due to some sloppy code ROMs are guarenteed to be hashed at least 3 times with SHA1 and at least 1 time with MD5 (possibly additional times for both in some cases). This itself is a bit annoying if you have a rather large file for a ROM. A worst case scenario can be seen with Pokemon Black/White 2, which is 512MiB big, and that takes 1.3 seconds to hash each time. Granted however, booting the core up takes much more time anyways if partially due to waterboxing and probably other sloppy code elsewhere. We should probably do something about that too.

Keep the one which matches redump.org still

Those are the full disc hashes.

CasualPokePlayer avatar Aug 28 '22 07:08 CasualPokePlayer

"cold discs" are ones you didn't load a few seconds ago on another test. it must be completely out of the filesystem caches, even the disk caches. maybe you should read as many gigs of garbage data as you have ram to make sure it's all completely gone... actually there is probably a win32 api call to invalidate the whole disk cache? or maybe just the cache for a certain filename? yeah that would be a better way to do it

zeromus avatar Aug 28 '22 09:08 zeromus

Seems with a "cold disc" it's around 1811 ms for 700MB. Afterwards it goes back down to around the numbers I saw.

Still not so bad probably (especially as that's a more worser case here, cold discs for smaller games are generally much less). I'll see how well my threading idea works and if it has a significant difference).

CasualPokePlayer avatar Aug 29 '22 00:08 CasualPokePlayer

Threading reads is not likely to help, in fact likely to hurt by making the OS and disk trying its hardest to serve multiple users not realizing they're actually one user trying to pretend to be 3 users. 3 users in parallel can't be serviced as quickly as 3 users in serial, you only take the added overhead of going parallel to increase responsiveness... of unrelated tasks. Here where it's all the same task, it's nuts, but good luck. You might get some overall improvements by threading the reading and hashing separately though (read chunks in one thread and hand off for hashing to another thread). But not a whole lot

zeromus avatar Aug 29 '22 00:08 zeromus

Yes, I was saying threading the reading and hashing, as those in theory are parallelizable. Probably not super much of an improvement.

CasualPokePlayer avatar Aug 29 '22 00:08 CasualPokePlayer

Apparently not even that helps, that just makes hashing take double the time (or worse, depending on how I do it). I suspect it's due to far more managed memory being allocated and that hurts performance more than threads can help.

CasualPokePlayer avatar Aug 29 '22 01:08 CasualPokePlayer

Doesn't the disc need to be parsed before it's hashed? Or is that the other one? Point is the managed parsing code is going to be the bottleneck.

YoshiRulz avatar Aug 29 '22 01:08 YoshiRulz

Disc "parsing" is not time-consuming, all the data is in a "header" (aka a cue file, or the actual header of a ccd file) and it's limited to 99 tracks. The big binary blob part/file requires no further analysis. It's designed to be random-access once the header is parsed

zeromus avatar Aug 29 '22 01:08 zeromus