hyperspace
hyperspace copied to clipboard
Revisit FileBasedSignatureProvider for a possible performance optimization
@apoorvedave1 mentioned:
"could you also hide this feature behind a flag which defaults to false? My concern is I think this would be unnecessary calculation for regular scenarios. In any case, if signature doesn't match, RuleUtils will make sure that if index can be used, it will be used right? I am just wondering why bother O(NlogN) for sorting always for a rare scenario which will still work without this change(assuming hybrid scan is enabled eventually by default)?
Alternately if we still want to fix this issue, could you avoid sorting? Instead you could convert fileInfos to set and then create fingerprint. Set will ensure the order of iteration is unique for a unique collection of elements. That way we can still achieve O(N)."
Originally posted by @apoorvedave1 in https://github.com/microsoft/hyperspace/issues/268#issuecomment-732451580
Related thread: https://github.com/microsoft/hyperspace/pull/268#discussion_r529071399
I think we cannot guarantee the order of HashSet iterator. It can differ depending on the internal implementation. (e.g. different hash bucket size, hash functions..)
Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.
Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?
Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.
@clee704 Since currently we don't check object after that, to avoid the comparison overhead during the optimize phase. (but for hybrid scan, we do compare all FileInfos)
Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?
Because the hash lib - import org.apache.commons.codec.digest.DigestUtils takes & returns String.
Maybe I misunderstood the issue, but why is the order important? Why can't we simply XOR or even just add signatures? I'm asking this because I don't get why "false positives" mentioned in #268 are problematic. After all you must check if two objects are really equal, even though their hash codes are equal.
@clee704 Since currently we don't check object after that, to avoid the comparison overhead during the optimize phase. (but for hybrid scan, we do compare all
FileInfos)
Okay. So it's like how SHA-1 hashes are used in git to identify objects. But if we accept the 2^-128 (or whatever bits you use) probability of collision, then I believe XOR or addition should be sufficient too to combine hashes of unordered elements. Assuming random distribution and 128-bit hash, the probability that A_1 + A_2 + ... + A_k == B_1 + B_2 + ... + B_k mod 2^128 (where A_i and B_i are hash values) happens is the same as that of A == B mod 2^128, which is, 2^-128, regardless of k.
Also, is there a specific reason we use String instead of byte[] and use md5Hex instead of md5 for signatures?
Because the hash lib - import org.apache.commons.codec.digest.DigestUtils takes & returns
String.
DigestUtils has both functions for returning byte[] and String. byte[] is a better option for internal handling for performance, and we can encode/decode the value to/from HEX whenever needed (e.g. for human or JSON).
@clee704 Yea I'm good with XOR approach & using byte[] (be aware of its format in JSON)
cc @imback82
Yeah, we should do proper HEX encoding/decoding when JSON is involved.
By the way, I think we should also avoid being locked in MD5 and should make it easy to change the underlying hashing algorithm (SHA-1, SHA-256, etc.) in the future.
Btw changing signature will break backward compatibility; so would be good to go with other breaking changes.
I assume that the signatures are stored as JSON in users' data lake. Therefore, any change to the signature computation will make existing signatures won't work (even if we don't change the hash function). Am I right? Then how could you merge the PR #268? @sezruby
Do we provide some migration utility to update their index data?
@clee704 Migration utility is a great idea! Can you please open a feature request and link ot here so we can keep track of it? While it may not be necessary for the current version, I think as we move towards stabilization, it'd be useful to provide one for users so we don't always have to worry about breaking changes.
@rapoth Release notes state that whenever there are breaking changes, indexes should be "reconstructed". Does it mean that users should manually create indexes again? Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right? So that could be the why we might need a migration utility?
Does it mean that users should manually create indexes again?
Yes
Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right?
Right, usually it failed to create IndexLogEntry, but in that case we could reuse the previous index data. I think we could catch the exception from building IndexLogEntry & trigger the utility to create a new log file.
Does it mean that users should manually create indexes again?
Yes
Hmm, okay. Now I see a need for a migration utility.
Because refreshAction won't work because it can't load the previous version's IndexLogEntry, or weird things can happen if it can (e.g. only the signature computation logic changed). Right?
Right, usually it failed to create IndexLogEntry, but in that case we could reuse the previous index data. I think we could catch the exception from building IndexLogEntry & trigger the utility to create a new log file.
Blindly trying to create a log from JSON and catching an exception doesn't seem to be a good approach, because we can't be sure that the constructed log is valid. For example, if only the signature computation logic is updated between Hyperspace versions, the log will be created just fine but with invalid signatures.