hyperspace Revisit FileBasedSignatureProvider for a possible performance optimization

@apoorvedave1 mentioned:

"could you also hide this feature behind a flag which defaults to false? My concern is I think this would be unnecessary calculation for regular scenarios. In any case, if signature doesn't match, RuleUtils will make sure that if index can be used, it will be used right? I am just wondering why bother O(NlogN) for sorting always for a rare scenario which will still work without this change(assuming hybrid scan is enabled eventually by default)?

Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N)."

Originally posted by @apoorvedave1 in https://github.com/microsoft/hyperspace/issues/268#issuecomment-732451580

Nov 24 '20 19:11 imback82

Related thread: https://github.com/microsoft/hyperspace/pull/268#discussion_r529071399

Nov 24 '20 19:11 imback82

I think we cannot guarantee the order of HashSet iterator. It can differ depending on the internal implementation. (e.g. different hash bucket size, hash functions..)

Nov 24 '20 23:11 sezruby

Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.

Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?

Mar 30 '21 01:03 clee704

Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.

@clee704 Since currently we don't check object after that, to avoid the comparison overhead during the optimize phase. (but for hybrid scan, we do compare all FileInfos)

Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?

Because the hash lib - import org.apache.commons.codec.digest.DigestUtils takes & returns String.

Mar 30 '21 19:03 sezruby

Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.

@clee704 Since currently we don't check object after that, to avoid the comparison overhead during the optimize phase. (but for hybrid scan, we do compare all FileInfos)

Okay. So it's like how SHA-1 hashes are used in git to identify objects. But if we accept the 2^-128 (or whatever bits you use) probability of collision, then I believe XOR or addition should be sufficient too to combine hashes of unordered elements. Assuming random distribution and 128-bit hash, the probability that A_1 + A_2 + ... + A_k == B_1 + B_2 + ... + B_k mod 2^128 (where A_i and B_i are hash values) happens is the same as that of A == B mod 2^128, which is, 2^-128, regardless of k.

Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?

Because the hash lib - import org.apache.commons.codec.digest.DigestUtils takes & returns String.

DigestUtils has both functions for returning byte[] and String. byte[] is a better option for internal handling for performance, and we can encode/decode the value to/from HEX whenever needed (e.g. for human or JSON).

Mar 31 '21 08:03 clee704

@clee704 Yea I'm good with XOR approach & using byte[] (be aware of its format in JSON)

cc @imback82

Apr 01 '21 02:04 sezruby

Yeah, we should do proper HEX encoding/decoding when JSON is involved.

By the way, I think we should also avoid being locked in MD5 and should make it easy to change the underlying hashing algorithm (SHA-1, SHA-256, etc.) in the future.

Apr 01 '21 05:04 clee704

Btw changing signature will break backward compatibility; so would be good to go with other breaking changes.

Apr 01 '21 23:04 sezruby

I assume that the signatures are stored as JSON in users' data lake. Therefore, any change to the signature computation will make existing signatures won't work (even if we don't change the hash function). Am I right? Then how could you merge the PR #268? @sezruby

Do we provide some migration utility to update their index data?

Apr 02 '21 09:04 clee704

@clee704 Migration utility is a great idea! Can you please open a feature request and link ot here so we can keep track of it? While it may not be necessary for the current version, I think as we move towards stabilization, it'd be useful to provide one for users so we don't always have to worry about breaking changes.

Apr 02 '21 13:04 rapoth

@rapoth Release notes state that whenever there are breaking changes, indexes should be "reconstructed". Does it mean that users should manually create indexes again? Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right? So that could be the why we might need a migration utility?

Apr 05 '21 13:04 clee704

Does it mean that users should manually create indexes again?

Yes

Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right?

Right, usually it failed to create IndexLogEntry, but in that case we could reuse the previous index data. I think we could catch the exception from building IndexLogEntry & trigger the utility to create a new log file.

Apr 07 '21 06:04 sezruby

Does it mean that users should manually create indexes again?

Yes

Hmm, okay. Now I see a need for a migration utility.

Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right?

Right, usually it failed to create IndexLogEntry, but in that case we could reuse the previous index data. I think we could catch the exception from building IndexLogEntry & trigger the utility to create a new log file.

Blindly trying to create a log from JSON and catching an exception doesn't seem to be a good approach, because we can't be sure that the constructed log is valid. For example, if only the signature computation logic is updated between Hyperspace versions, the log will be created just fine but with invalid signatures.

Apr 07 '21 14:04 clee704

hyperspace hyperspace copied to clipboard

Revisit FileBasedSignatureProvider for a possible performance optimization

hyperspace
hyperspace copied to clipboard