heads icon indicating copy to clipboard operation
heads copied to clipboard

Change TPM2 DA Lockout policy for tamper evidence use case

Open tlaurion opened this issue 5 months ago • 5 comments

Looking at openQA jobs, even reducing recovery time to 20min (below duration of any test job) should fix it. Even if some startup/cleanup actions would trigger DA counter increase, it would recover before the next test job. And 10 attempts per 20min sounds still like effective anti-bruteforce measure, right?

@marmarek : If we go back at what the TPM DA lockout is meant for, which is bruteforcing prevention/rate limiting/tamper evidence, and Heads use case which currently are:

  • To only permit 3 TPM Disk Unlock Key (DUK) passphrase attempts before proposing user to use LUKS Disk Recovery Key (DRK) passphrase (chosen at OS installed/LUKS passphrase change), then we have to go back at what is a possible bruteforce in this case.
    • I would say 3 attempts max in 1 minute, meaning 3 counters consumed for 3 minutes max.
    • 3 reboots to attempt different TPM DUK (smaller passphrase expected then LUKS DRK, so enticing for evil-maid having partial recording of that passphrase and willing to try) would consume again maybe another 3 attempts, with previous attempts maybe aging out before failing again
    • Attacker could attempt 3x3 + 1 (10) attempts before locking out in 12 minutes? Or be limited to trying max 15 passphrases in roughly let's say 20 minutes if recovery time was switched from 3600s (1h) to 600s (10m). @marmarek thoughts?
  • To consume 1 TPM DA counter upon flashing firmware upgrade/tampered firmware not matching previously sealed TPMTOTP, requesting TPM2 Owner password (consuming 1 DA counter+ die here, user having to retry)+ HOTP dongle + its PIN upon TPMTOTP resealing success (either GPG Admin PIN for Librem Key or Secure App PIN for Nitrokey 3), + Setting TPM Disk Unlock Key (DUK) on default boot (requiring LUKS DRK) otherwise leaving system in a tamper evident state.

I tried a 600s (10m) recovery time for each TPM DA counter: it worked great. Thoughts on this change? (To be opened in different issue to change TPM reset policy).

Originally posted by @tlaurion in #1986


current policy configured on TPM reset so all TPM2 chips are configured the same way (otherwise depends on manufacturer)

Master's TPM DA lockout policy is set upon TPM reset: https://github.com/linuxboot/heads/blob/d0350e02f4f88fbc16eb26893c2edc8e7cdae441/initrd/bin/tpmr#L677-L698

  • --max-tries=10 : 10 bad auth counters, each of defined recovery time aging out. TPM DA lockout (cannot unseal, cannot type TPM related secret) happens after 10 counters, triggered per either forced poweroff, forced reset (power failure detected by TPM) or TPM auth failure.
  • --recovery-time=3600: Each TPM DA counter (of max tries above) aging out time window once triggered. Each counter currently take 3600 seconds (1h) to age out. Unless the 10 counters are aging out, the system is not in TPM DA lockout.
  • --lockout-recovery-time=0 : 0 means no lockout recovery time. This means only way to unlock TPM DA lockout is to reset TPM per current policy, which was is applied in respect of tamper evidence contract Heads enforces.

Master's TPM DA lockout cannot be unlocked, since TPM DA lockout unlock passphrase is randomized and unknown: https://github.com/linuxboot/heads/blob/d0350e02f4f88fbc16eb26893c2edc8e7cdae441/initrd/bin/tpmr#L700-L707

  • there is a lockout unlock passphrase that can be defined, and could be used upon a "good auth attempt" to reset to 0 the counters, or the lock out, which is this one.
  • As can be seen, this lockout passphrase is changed from being empty (meaning one can unlock/reset TPM DA counters without auth) by default. Here we set it to a random value. Here again, this is configured in respect for Heads tamper evidence policy, which once TPM DA lockout is reached, requires TPM reset. See a above referred comment from other issue to realize resetting TPM alone will not go unnoticed.

EDIT: 600s is 10m not 5m, sorry for the error

tlaurion avatar Jul 16 '25 15:07 tlaurion

 tpm2 dictionarylockout -Q --setup-parameters \ 
 	--max-tries=10 \ 
 	--recovery-time=600 \ 
 	--lockout-recovery-time=0 \ 
        --auth="session:$ENC_SESSION_FILE" >/dev/null 2>&1 || LOG "Unable to set dictionary lockout parameters" 

Attacker could attempt 3x3 + 1 (10) attempts before locking out in 12 minutes? Or be limited to trying max 15 passphrases in roughly let's say 20 minutes if recovery time was switched from 3600s (1h) to 600s (5m). @marmarek thoughts?

So if we changed 3600s (1h) to 600s (5m):

  • 9 attempts (3 reboots of 3 attempts) won't trigger TPM DA lockout under 5 minutes.
  • 10 attempts (4 reboots of 3 attempts) might trigger TPM DA lockout under 5 minutes for an attacker but not really a normal user (5 minutes is kinda short and needs to be done in a sprint to happen)
  • After 5 minutes, the first 3 bad attempts will begin to age out. We can consider that at the 6th minute passing to 7th minute, the first 3 counters will age out, bringing the counter to 3.
  • Etc.

QA CI/CD testing with forced power off with 600s: this is a lot of bad TPM DUK (unless this is tested, which consumes those fast vs reboots?) I think 600s (5m) would be good.

--

@marmarek note that for PR to be used in OpenQA (deployed on platform), this will need to land in a downstream release even if merged in a PR, and comply with QubesOS certification restrictions for the number of releases permitted in a year (it was 1y freeze for the PrivacyBeast back then. I hope this changed).

Otherwise as I said, the best mitigation for the current deployed TPM DA lockout policy deployed is to extend the use of ctrl-alt-delete as much as possible. And infer the cause of the lockout in CI/CD tests from last successful change test run, meaning more QA tests with less changes to test between each CI/CD run.

tlaurion avatar Jul 16 '25 15:07 tlaurion

As said elsewhere, completely avoiding hard poweroff is not feasible, that's the nature of tests - stuff crashes (if everything always would work, we wouldn't need to run those tests in the first place). It can be reduced a bit, but I'm not sure if that will be enough in a pessimistic case with the 1h recovery time.

5min recovery time should be short enough to not trigger lockout in CI. But also, it feels like it doesn't limit bruteforcing much then - Heads already limits to 3 attempts per startup and then you need to reboot to try again. 4 reboots take significant part of those 5 minutes already. But also, enforcing limit of 10 attempts per 5 minutes (so, 120 attempts per 1h) sounds like still preventing real brute force attempts. If somebody has enough info about the passphrase to limit it to few tens/hundreds attempts, a stricter rate limit wouldn't IMHO help in practice either.

For CI case specifically, I think also 20min would work (30 attempts per 1h), as it's shorter than any individual CI job. But if you are comfortable with changing it to 5min, I won't complain.

marmarek avatar Jul 16 '25 23:07 marmarek

Wait, 600s is 10min, not 5min. So, it's 60 attempts per 1h, not 120. 600s would work too.

marmarek avatar Jul 16 '25 23:07 marmarek

FWIW I loaded the 600s version now. It passed this round of CI tests, but there weren't many failures this time.

marmarek avatar Jul 17 '25 12:07 marmarek

https://github.com/linuxboot/heads/issues/1988#issuecomment-3079250158 comment wrongly converted 600s for 5m. 600s is 10 minutes.

Let's revisit now hidden comment:

 tpm2 dictionarylockout -Q --setup-parameters \ 
 	--max-tries=10 \ 
 	--recovery-time=600 \ 
 	--lockout-recovery-time=0 \ 
        --auth="session:$ENC_SESSION_FILE" >/dev/null 2>&1 || LOG "Unable to set dictionary lockout parameters" 

Attacker could attempt 3x3 + 1 (10) attempts before locking out in 12 minutes? Or be limited to trying max 15 passphrases in roughly let's say 20 minutes if recovery time was switched from 3600s (1h) to 600s (5m). @marmarek thoughts?

So if we changed 3600s (1h) to 600s (5m):

* 9 attempts (3 reboots of 3 attempts) won't trigger TPM DA lockout under 5 minutes.

* 10 attempts (4 reboots of 3 attempts) might trigger TPM DA lockout under 5 minutes for an attacker but not really a normal user (5 minutes is kinda short and needs to be done in a sprint to happen)

* After 5 minutes, the first 3 bad attempts will begin to age out. We can consider that at the 6th minute passing to 7th minute, the first 3 counters will age out, bringing the counter to 3.

* Etc.

QA CI/CD testing with forced power off with 600s: this is a lot of bad TPM DUK (unless this is tested, which consumes those fast vs reboots?) I think 600s (5m) would be good.

--

@marmarek note that for PR to be used in OpenQA (deployed on platform), this will need to land in a downstream release even if merged in a PR, and comply with QubesOS certification restrictions for the number of releases permitted in a year (it was 1y freeze for the PrivacyBeast back then. I hope this changed).

Otherwise as I said, the best mitigation for the current deployed TPM DA lockout policy deployed is to extend the use of ctrl-alt-delete as much as possible. And infer the cause of the lockout in CI/CD tests from last successful change test run, meaning more QA tests with less changes to test between each CI/CD run.

@marmarek a lot can happen within 10 minutes. We can take for granted 3 TPM DUK attempts within 1m. A reboot and retry to reoccur within the next minute. Basically, yes: 10 attempts could be consumed within 5 minutes; 10 attempts can definitely consumed within 10 minutes since TPM DA counters won't age out until 10 minutes passed for the first bad attempt. I wonder, in practice, what would happen with a counter of 10 minutes, just like a lot can currently happen with current 3600s (1 hour) counter for each attempt; while no issue were raised on the matter up to now.

I feel we still have a misunderstanding on how the counter works though @marmarek.

If TPM DA counters could age out when thought to be 5 minutes (300s), those won't age out if 10 minutes (600s). This means that the TPM DUK scenario above (which I hid) is invalid.

TLDR: If your sole concern is for the TPM DA lockout counters to decrease while a QA run (20 minutes) happen, the only thing I can question you about here to finish the thought process correctly is how many forced reboot the QA can issue within that 20 minutes, and what is a good aging out value (20 minutes would never age out in a QA run, and if another one starts over, that definitely will lead to a TPM DA lockout again in the future...). Where TPM DUK bruteforce rate limiting would be anything 5 minutes or more, where the number of counts would be to be fixated here (currently 10; where I think there was a misunderstanding on the fact that each TPM DA counter ages out individually in practice; and to see this in action, one has to lower the value from 3600s to something that can be waited for to be obervable, again in DEBUG mode for now in referred branch I used for testing; and where policy is applied on TPM reset call.)

tlaurion avatar Jul 17 '25 17:07 tlaurion