[hma] Implement Bank & content-level disable
https://github.com/facebook/ThreatExchange/blob/148d8cc60267ceefd13d324a877868ae43b75d54/hasher-matcher-actioner/src/OpenMediaMatch/storage/postgres/database.py#L89
https://github.com/facebook/ThreatExchange/blob/148d8cc60267ceefd13d324a877868ae43b75d54/hasher-matcher-actioner/src/OpenMediaMatch/storage/postgres/database.py#L140
Both Bank and BankContent have database fields to allow them to be disabled. However, those fields are possibly not settable, nor read today. Banks can be ramped up fractionally, and BankContent can be set to disable for a time.
This is a multi-stage feature issue, here are roughly the stages:
- [x] Confirm that an API exists (under curator role) that allows setting Bank and BankContent disable states (implement if not)
- [x] Bank enable_ratio
- [x] BankContent disable_until_ts
- [ ] Implement fractional matching for Bank
- [ ] If the bank is 0% enabled, it should not contribute its hashes to the index (skipped during indexing)
- [ ] Implement skip in index_build
- [ ] If (0 < enable_pct < 100), then during resolution to bank, then a coinflip should be made to determine whether this lookup should be a match or not. This coinfip should be stable (e.g. you get the consistent answer each time). To make this stable, you can digest the signal string to a value between 0 and 1, and compare that to the enable pct.
- [ ] Additionally, we should add an optional content_id string field to the request, which if provided, should be the source of the coinflip seed instead.
- [ ] If the bank is 0% enabled, it should not contribute its hashes to the index (skipped during indexing)
- [ ] Implement time disabled for BankContent
- [ ] Add a constant which represents "permanently disabled", which should not contribute its hashes to the index (skipped during indexing)
- [ ] Implement skip in index_build
- [ ] If (0 < disabled_until_ts < PERMANENTLY_DISABLED), then the request time should be compared against the disable timestamp, and it should not be considered a match if before this time
- [ ] Raw lookup should not do this check (it's meant to be ~direct access to the index)
- [ ] Add a constant which represents "permanently disabled", which should not contribute its hashes to the index (skipped during indexing)
- [ ] Unittest everything
Hey @Dcallies I wanna make sure I'm understanding the concept of Bank, BankContent, and ContentSignal... looking at the db diagram it doesn't seem like ContentSignal was included yet. I get Bank and BankContent, so is ContentSignal different signal type/value pairs for a specific piece of content (url, file)?
I see BankContent (disable_until_ts) and Bank (enabled_ratio) have a way to be disabled already, so is the goal here to do the same on ContentSignal?
Hey @aryzle , good question, and I note that we didn't add documentation to any of these classes to help answer it in the code itself, which is where I'd prefer the answer to live!
- Bank: Conceptually, a collection of content that has been labeled with similar labels. Matches to the contents of this bank should be classified with those labels. Basically a folder.
- BankContent: A single piece of content that has been labeled. Due to data retention limits for harmful content, and hash sharing, this may no longer point to any original content, but represent the idea of a single piece of content.
- ContentSignal: The signals for a single piece of labeled content. We could have also called this
BankContentSignal
Matching only takes place on signals - during the lookup operation, we find matching signals and return ids corresponding to the BankContent, which further resolve to the banks themselves which then essentially returns the classification labels.
I see BankContent (disable_until_ts) and Bank (enabled_ratio) have a way to be disabled already, so is the goal here to do the same on ContentSignal?
Nope, we only need to the ability to disable BankContent - but the functionality is unimplemented! We need:
- An API that allows setting disable
- The disable state to be read during matching, to ignore it during lookup
- The disable state to be read during indexing, to not add it to the index