specification icon indicating copy to clipboard operation
specification copied to clipboard

How should clients handle interrupted updates?

Open erickt opened this issue 4 years ago • 2 comments

I am currently doing some exploration into how clients should handle interrupted, partially successful updates. For example, say we have a client that has a local cached copy of valid and unexpired metadata. We start an update process which includes a new timestamp, snapshot, and targets metadata. Unfortunately, we download the new timestamp and snapshot and persist them to disk, but the device loses power. Then when power is restored the network is down. We'd still like to make queries against the TUF targets file, but according to the workflow, we should get an error. We can only recover from this by restoring the network.

This is particularly relevant to Fuchsia, because of how we have created our packaging system. We want to treat the TUF targets as the list of executable packages, since it allows us to maintain a cryptographic chain of trust all the way down to the bootloader for what things can be executed. All our packages are stored in a content-addressed filesystem, and we use the custom field in a TUF target to provide the mapping from a human readable name to a merkle addressed package blob. When we try to open a package, we first look in TUF to find the merkle, then we check if we've already downloaded that blob. If so, we open up that package and serve it to the caller. See this slightly stale doc for more details. Due to this interrupted update problem, there's a chance a Fuchsia device could be made unusable until we are able to finish updating our metadata.

If not, we have had a few ideas on how to approach this:

  • If an update fails, we could still query the local latest targets metadata, assuming it was signed with a key that's still trusted by the root metadata.
  • During the update, we delay writing all the metadata to disk until all the files have been downloaded and verified. Then the files are written in one atomic transaction.
  • For consistent snapshot metadata (which we only plan on supporting), fetch the timestamp metadata, but don't persist it to disk yet. Fetch and write the versioned-prefixed snapshot and targets metadata, and any other delegated metadata, to disk. Atomically write the timestamp metadata to disk, then clean up any old snapshot/targets/etc metadata files.

I'm not sure if these ideas would weaken the TUF security model though. Is there a better way for dealing with this, and could we incorporate this into the spec (or a POUF?), since I imagine other folks might need a solution for this.

erickt avatar Dec 03 '19 05:12 erickt

I think option 2 above makes more sense:

During the update, we delay writing all the metadata to disk until all the files have been downloaded and verified. Then the files are written in one atomic transaction.

for implementation simplicity, this could be as simple as having current and next directories (where next is probably a temporary directory). The updater workflow would proceed in the next directory until all files have been verified and downloaded and only then are the contents of next moved to current. This is simpler than the current/previous model because we don't have to worry about loading partial metadata, only preventing it from being persisted.

I don't believe this would weaken the TUF security model, but perhaps others will speak up.

Aside, I'd love to see a POUF for Fuchsia's TUF implementation.

joshuagl avatar Aug 28 '20 13:08 joshuagl

The detailed client workflow states:

Note: If a step in the following workflow does not succeed (e.g., the update is aborted because a new metadata file was not signed), the client should still be able to update again in the future. Errors raised during the update process should not leave clients in an unrecoverable state.

The reference implementation interprets handles this by storing current and previous versions of the metadata.

joshuagl avatar Aug 28 '20 13:08 joshuagl