python-tuf icon indicating copy to clipboard operation
python-tuf copied to clipboard

RFE: expose delegated metadata to client application

Open jku opened this issue 3 years ago • 16 comments

EDIT: The overall issue is described in detail in https://docs.google.com/document/d/1rWHAM2qCUtnjWD4lOrGWE2EIDLoA7eSy4-jB66Wgh0o . The suggestion here is roughly the Metadata role (file) as search index solution in the document.

Assume a setup like this (this is what we expect a community artifact repository like PyPI to look like if it uses developer signatures with TUF):

  • a specific project/product team controls a delegated metadata
  • TUF clients want to know details of all of the artifacts in this metadata (to e.g. figure out which versions of an artifact are available)

Currently there is no way for the client application to get the whole metadata content from ngclient. We could provide a call much like get_targetinfo() that instead of the TargetFile would return the Targets object where the target search ended:

def get_targets_metadata(target_path: str) -> Targets
    """returns a Targets object of the metadata where the search for target_path terminated"""

This is not applicable to every TUF repo:

  • it requires a "contract" between repository and client: client has to know of a target_path that is delegated to the correct metadata -- in the pypi example it could be e.g. the PyPI project name
  • this is only useful if all "related" target files are listed in the same metadata

But with those assumptions the client can now easily get not just the list of target files it's interested in but also any custom metadata embedded in the targets metadata.

I've not thought through all the cases (what happens if there is no targetpath match? what if there is no terminating delegation?) but I think this is something we could consider implementing

jku avatar May 04 '22 08:05 jku

I would expect that in the case that this target file delegates to other target files, it would include all of the items listed there. (This would continue transitively.) It would also need to handle special delegation cases, thresholds, etc. Is this your thinking as well?

JustinCappos avatar May 05 '22 07:05 JustinCappos

I was not thinking that, no -- but that could work as well...

My original idea was to literally return the equivalent of the "signed" json object of the "final" targets metadata (the one that terminates delegation either by containing the targetpath or by terminating=True): this would require no extra processing on the client part, but as I mentioned is only useful if all "related" target files are listed in the same metadata. Your idea of building a list of target files while doing the depth first search through the delegation tree (and appending all target files iff the targets metadata happens to be part of the delegated portion of the tree) is certainly more complex but it is interesting as it removes the limitation of "one metadata for related target files" -- I would have to prototype to see if there are any unintended results there.

This is what I proposed:

def get_targets_metadata(target_path: str) -> Targets
    """returns a Targets object of the metadata where the search for target_path terminated"""

This what I think Justin is describing:

def get_all_targetinfos(target_path: str) -> List[TargetFile]:
    """Returns a list of all target files in all targets metadata that forms the delegating chain for 'target_path'"""

I don't think there's anything special to handle wrt threshold etc: if the delegations work for normal targetpath search, they should work for this.

jku avatar May 05 '22 09:05 jku

@jku If you are looking for another use case: it looks like our notsotuf client would also benefit from such a feature.

dennisvang avatar Jun 02 '22 07:06 dennisvang

@dennisvang it may have been your comments some time ago that got me thinking about it :)

Btw if you have any feedback or suggestions on python-tuf 1.x from downstream perspective, that would be very welcome -- creating issue is fine or slack works too

jku avatar Jun 02 '22 07:06 jku

This is what I proposed:

def get_targets_metadata(target_path: str) -> Targets
    """returns a Targets object of the metadata where the search for target_path terminated"""

This what I think Justin is describing:

def get_all_targetinfos(target_path: str) -> List[TargetFile]:
    """Returns a list of all target files in all targets metadata that forms the delegating chain for 'target_path'"""

I think this is a useful feature! Here are some unordered thoughts/questions on the two existing proposals:

  • Both solutions require a contract between repo and client that the function yields all/only "related" target file infos.
  • Solution 2 seems more general/flexible as it can cover cases, where related target file infos are spread across multiple target metadata files AND where they are all in the last target file. This seems like an advantage.
  • Is solution 2 more prone to also serve unrelated target file infos? Probably not / depends on the "contract".
  • Why does solution 1 return a full Targets object and solution 2 a list of only TargetFiles?
  • Do we need the full Targets metadata in either case?
  • Is target_path here the same as in get_targetinfo or can it also be a path prefix or path pattern?
  • If target_path can be only a part of the path or a pattern, then the delegation tree might not resolve in the same way as the individual "related" files would.

lukpueh avatar Jun 07 '22 13:06 lukpueh

@jku issue #822 looks related.

dennisvang avatar Jun 08 '22 11:06 dennisvang

Yes it definitely is related. Searching is still a very complex beast and we shouldn't think this will actually solve that problem completely: I don't think this library even can solve searching in general: it really is an application problem. But we could provide this functionality so that repositories can design their content so that this functionality can be used for specific types of searches.

jku avatar Jun 08 '22 13:06 jku

Forgot to respond to lukas here:

Is solution 2 more prone to also serve unrelated target file infos? Probably not / depends on the "contract".

Yeah, there's certainly a chance of an earlier targets metadata to contain "unrelated" files that get listed (in the same sense that it allows multiple metadata to contain the "related" files). This is what makes the two approaches different...

Why does solution 1 return a full Targets object and solution 2 a list of only TargetFiles? Do we need the full Targets metadata in either case?

solution 1 returns Targets just because it can -- I figured this would allow e.g. custom fields in the Targets to be available to client. In the second option it's not as simple.

I don't know of a specific need for Targets.

Is target_path here the same as in get_targetinfo or can it also be a path prefix or path pattern?

I think it has to be the same thing: an explicit targetpath that in this case is just used to find "all targetfiles in the chain of delegations for this targetpath" (or "...in the last delegation for this targetpath" for solution 1). It's a bit unintuitive but could be useful...

jku avatar Jun 15 '22 08:06 jku

Do we need the full Targets metadata in either case?

Oh and the opinion I forgot: I maybe lean towards the List[TargetFile] return value anyway regardless of solution. ngclient public API already includes TargetFile, but does not currently expose the Signed-derivatives or other Metadata API details: I like that split

jku avatar Jun 15 '22 08:06 jku

The List[TargetFile] option would be sufficient for our specific use-case.

Currently, we base our search on the target_path values obtained from tuf.ngclient.updater.Updater._trusted_set.targets.signed.targets.keys(), although I'm not sure that would always work.

dennisvang avatar Jun 15 '22 08:06 dennisvang

Assume a setup like this (this is what we expect a community artifact repository like PyPI to look like if it uses developer signatures with TUF):

  • a specific project/product team controls a delegated metadata
  • TUF clients want to know details of all of the artifacts in this metadata (to e.g. figure out which versions of an artifact are available)

Currently there is no way for the client application to get the whole metadata content from ngclient.

After discussing with @kairoaraujo we realised that using hashbin delegation anywhere in the delegation chain breaks this idea. Because the hashing happens over the complete artifact targetpath (and not some policy object like "project name") we can't possibly list all targets related to a project or find out the current version of a product.

This is just a side effect of TUF not really understanding concepts like project, product or version: everything is an independent artifact in TUF. There are multiple questions this architecture (when using hashed bins) can't solve without additional data:

  • what is the newest version of product X?
  • which versions exist for product X?
  • what products are owned (signed) by project Y?

At least the first one is a question all package repository clients want to answer. Maybe larger repositories just are going to need an additional layer to handle that (and to store the project/product/version mapping in TUF target files to secure that info, just like PEP-458 currently does)...

This leads to another question: if you have to include more structured data about your artifacts in TUF already, why not include the TARGETINFO data there already -- I mean the download URL and hashes. why would you list those artifacts separately in TUF metadata and force your clients to do two round trips?

jku avatar Sep 22 '22 09:09 jku

This leads to another question: if you have to include more structured data about your artifacts in TUF already, why not include the TARGETINFO data there already -- I mean the download URL and hashes. why would you list those artifacts separately in TUF metadata and force your clients to do two round trips?

A simple answer: To allow standardized target file verification without the need for concepts like project, product or version.

lukpueh avatar Oct 03 '22 10:10 lukpueh

but I am talking about the case where project, product and version are needed by the client code to even find the final target it wants to download: the reality is that this approach of listing targets separately (in the case where the client needs the extra structured data anyway like pip does) leads to more complex client code, larger metadata files and the additional server roundtrip for every download, as seen in the pip prototypes...

Even with the app-specific-structured-data client could still use Updater.download_target() to verify the final targets: the only thing it needs to do is extract the correct TARGETINFO data from the application specific structured data.

jku avatar Oct 03 '22 10:10 jku

The issue is described in detail in https://docs.google.com/document/d/1rWHAM2qCUtnjWD4lOrGWE2EIDLoA7eSy4-jB66Wgh0o

The original suggestion in the issue description is roughly the Metadata role (file) as search index solution in the document.

jku avatar Oct 12 '22 09:10 jku

I guess I should update current thinking on this.

I think exposing the metadata to clients as described has security implications that may mean this is not a good idea. The fact that a delegated roles metadata contains targetpaths does not mean that those targetpaths have been delegated to the role. So exposing the list as is seems wrong, even if this is documented as unsafe.

The only really safe way to do this would be to run the delegation lookup for each targetpath listed, and only expose it to client if the targetpath really is delegated to the role in question. This sounds a bit wasteful but in practice might work just fine: in usual cases this would not lead to new metadata downloads and all required metadata would already be loaded in memory.

jku avatar Dec 02 '22 12:12 jku

Linking to my rough branch so it doesn't get lost: https://github.com/jku/python-tuf/commits/list-targets

  • needs tests
  • the delegated roles metadata (or even role name) is never exposed to client application in this approach
  • the original targetpath argument does not need to be an existing targetpath: the last handled delegated role is used in any case

jku avatar Dec 06 '22 09:12 jku