ThreatExchange icon indicating copy to clipboard operation
ThreatExchange copied to clipboard

[hma] Implement Bank & content-level disable

Open Dcallies opened this issue 1 year ago • 2 comments

https://github.com/facebook/ThreatExchange/blob/148d8cc60267ceefd13d324a877868ae43b75d54/hasher-matcher-actioner/src/OpenMediaMatch/storage/postgres/database.py#L89

https://github.com/facebook/ThreatExchange/blob/148d8cc60267ceefd13d324a877868ae43b75d54/hasher-matcher-actioner/src/OpenMediaMatch/storage/postgres/database.py#L140

Both Bank and BankContent have database fields to allow them to be disabled. However, those fields are possibly not settable, nor read today. Banks can be ramped up fractionally, and BankContent can be set to disable for a time.

This is a multi-stage feature issue, here are roughly the stages:

  • [x] Confirm that an API exists (under curator role) that allows setting Bank and BankContent disable states (implement if not)
    • [x] Bank enable_ratio
    • [x] BankContent disable_until_ts
  • [ ] Implement fractional matching for Bank
    • [ ] If the bank is 0% enabled, it should not contribute its hashes to the index (skipped during indexing)
      • [ ] Implement skip in index_build
    • [ ] If (0 < enable_pct < 100), then during resolution to bank, then a coinflip should be made to determine whether this lookup should be a match or not. This coinfip should be stable (e.g. you get the consistent answer each time). To make this stable, you can digest the signal string to a value between 0 and 1, and compare that to the enable pct.
      • [ ] Additionally, we should add an optional content_id string field to the request, which if provided, should be the source of the coinflip seed instead.
  • [ ] Implement time disabled for BankContent
    • [ ] Add a constant which represents "permanently disabled", which should not contribute its hashes to the index (skipped during indexing)
      • [ ] Implement skip in index_build
    • [ ] If (0 < disabled_until_ts < PERMANENTLY_DISABLED), then the request time should be compared against the disable timestamp, and it should not be considered a match if before this time
    • [ ] Raw lookup should not do this check (it's meant to be ~direct access to the index)
  • [ ] Unittest everything

Dcallies avatar Nov 06 '24 18:11 Dcallies

Hey @Dcallies I wanna make sure I'm understanding the concept of Bank, BankContent, and ContentSignal... looking at the db diagram it doesn't seem like ContentSignal was included yet. I get Bank and BankContent, so is ContentSignal different signal type/value pairs for a specific piece of content (url, file)?

I see BankContent (disable_until_ts) and Bank (enabled_ratio) have a way to be disabled already, so is the goal here to do the same on ContentSignal?

aryzle avatar Dec 29 '24 21:12 aryzle

Hey @aryzle , good question, and I note that we didn't add documentation to any of these classes to help answer it in the code itself, which is where I'd prefer the answer to live!

  • Bank: Conceptually, a collection of content that has been labeled with similar labels. Matches to the contents of this bank should be classified with those labels. Basically a folder.
  • BankContent: A single piece of content that has been labeled. Due to data retention limits for harmful content, and hash sharing, this may no longer point to any original content, but represent the idea of a single piece of content.
  • ContentSignal: The signals for a single piece of labeled content. We could have also called this BankContentSignal

Matching only takes place on signals - during the lookup operation, we find matching signals and return ids corresponding to the BankContent, which further resolve to the banks themselves which then essentially returns the classification labels.

I see BankContent (disable_until_ts) and Bank (enabled_ratio) have a way to be disabled already, so is the goal here to do the same on ContentSignal?

Nope, we only need to the ability to disable BankContent - but the functionality is unimplemented! We need:

  1. An API that allows setting disable
  2. The disable state to be read during matching, to ignore it during lookup
  3. The disable state to be read during indexing, to not add it to the index

Dcallies avatar Dec 30 '24 12:12 Dcallies