intermodal icon indicating copy to clipboard operation
intermodal copied to clipboard

Parallel Hashing

Open mr-bo-jangles opened this issue 3 years ago • 3 comments

https://github.com/casey/intermodal/blob/2346c30fec91633444accb83837a132379e0ea00/src/hasher.rs#L69

This may be something you've already ruled out, however I wanted to suggest it just in case

mr-bo-jangles avatar Jun 29 '21 00:06 mr-bo-jangles

It's actually slightly more complex, since the hasher will produce a different output depending on the order that files are passed to it.

In a torrent file, info.pieces contains the bytes of the SHA hashes of the contents of the files, and often multiple files will contribute to a single hash.

Consider a torrent with the following files:

a: "xyz",
b: "123",

If the piece size is 2, info.pieces will contain 3 hashes. The 3 hashes will be hash("yx"), hash("z1"), and hash("23").

In imdl's implementation Hasher::hash_file is called with the path to a, which will then add hash("xy") to the in-progress info.pieces, then the next call to Hasher::hash_file must be passed the path to b, so that it gets the first byte of b, in this case 1, so that it can push hash("z1") into info.pieces. So since these calls are order-sensitive, it can't be parallelized with rayon without some additional refactoring.

Thanks for the suggestion though! I wish it were that simple T_T

casey avatar Jun 30 '21 00:06 casey

Then could you use the chunked iterator to split and feed the hasher piece sized portions in parallel somewhere around line 101 with rayon returning the hashes with an index as it finishes?

On Wed, 30 Jun 2021, 01:24 Casey Rodarmor, @.***> wrote:

It's actually slightly more complex, since the hasher will produce a different output depending on the order that files are passed to it.

In a torrent file, info.pieces contains the bytes of the SHA hashes of the contents of the files, and often multiple files will contribute to a single hash.

Consider a torrent with the following files:

a: "xyz", b: "123",

If the piece size is 2, info.pieces will contain 3 hashes. The 3 hashes will be hash("yx"), hash("z1"), and hash("23").

In imdl's implementation Hasher::hash_file is called with the path to a, which will then add hash("xy") to the in-progress info.pieces, then the next call to Hasher::hash_file must be passed the path to b, so that it gets the first byte of b, in this case 1, so that it can push hash("z1") into info.pieces. So since these calls are order-sensitive, it can't be parallelized with rayon without some additional refactoring.

Thanks for the suggestion though! I wish it were that simple T_T

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/casey/intermodal/issues/495#issuecomment-871006730, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACHF4UK4MKMS746F23V3TLTVJP4LANCNFSM47PBKBMA .

mr-bo-jangles avatar Jun 30 '21 15:06 mr-bo-jangles

I suppose that there would have to be an iterator over file bytes first, and that iterator would need to be parallelized. One question is whether the current hashing algorithm is I/O or CPU bound, since that would suggest whether parallelizing reads or hashing should be the priority.

This is discussed a bit in #26, but I think this issue is useful for tracking parallelization of hashing.

casey avatar Jul 03 '21 02:07 casey