specification icon indicating copy to clipboard operation
specification copied to clipboard

Specify container description

Open csarven opened this issue 3 years ago • 126 comments

Background: to date, the Solid Protocol (including earlier drafts and issues) only required server-managed containment statements in the representation of a container. Additional information such as last modification, size, resource type etc. about the contained resources as part of the container representation was deemed to be optional or considered to be a best practice. Examples in the wild show that some servers do make this additional information available, meanwhile some other servers do not support it. Some applications do make use of the information if available or work around the limitation to get a hold of the information [Anecdotal Evidence]

General use case: Support navigation of the container and its contents.

Use cases:

  • Guinan is viewing a list of their social assets (eg. photos, blog posts) and wants to select and view a resource by its human-readable name.
  • Janeway is viewing their inbox and wants to respond to unread notifications from oldest to most recent.
  • Dax is viewing their collection of short-films and wants to delete the ones occupying significant portion of available storage quota.
  • Burnham is viewing a list of their crew's personal logs and wants to archive the ones that are created by certain individuals.

Related UCs:

  • https://www.w3.org/TR/ldp-ucr/#uc3
  • https://www.w3.org/TR/ldp-ucr/#uc5
  • https://www.w3.org/TR/ldp-ucr/#uc6

Scenarios to consider:

  • Resources with URIs having non-human-friendly path segments eg. https://example.org/{uuid}
  • Container including mixed resource types eg. containers and non-containers, resources with different formats or media types.
  • Container including public and access controlled resources.

General requirement: Include descriptions about contained resources in container's description to further support navigation and application interaction.

Specific requirements:

  1. Any information (eg. human-readable label of resources) that may be client or server-managed.
  2. Server-managed (controlled) information (eg. last-modified, resource size, resource types for controlled interaction models)

Considerations:

  • Some resources may require authentication and authorization, and in those cases information about those resources must not leak into the container description.
  • Is there empirical data on container response times making certain kinds of information available about its contained resources?
  • What would the application UX be like when device and network constraints are taken into account?
  • What's the cost for servers when this data is not used by applications?
  • What information must a server make available in container description (besides containment triples)?
  • Caching, caching, caching..

Related issues:

  • https://github.com/solid/specification/issues/63
  • https://github.com/solid/specification/issues/65
  • https://github.com/solid/specification/issues/116
  • https://github.com/solid/specification/issues/144
  • https://github.com/solid/specification/issues/177
  • ..

Notes:

  • Some servers may be read-only and do not require authentication or authorization, hence, the requirement to check access privileges per resource (in order to expose additional data about the resource) is inapplicable.
  • If additional information about a resource is made available in the container description, how does that effect write operations on the container eg. server ignores statements with certain (server-managed) properties?
  • Instead of the container resource, the associated description resource of a container (ie. target of describedby) could include information about the contained resources. Doesn't violate best practice on self-describing documents per se but it is perhaps not the most intuitive place to look for additional information about the contained resources.
  • Would requiring the client to explicitly request additional information through the Prefer header be meaningful?

csarven avatar Feb 04 '21 14:02 csarven

I find the use cases to include "basic" information about contained resources in the container description compelling. Applications can immediately provide simple functionality by keeping the number of requests/connections minimal. It'd be reasonable to require this level of support on container read operations from servers in order to enable "smart" enough applications to get off the ground without having to resort to more advanced mechanisms.

I would consider last modification and size to be "basic" information. Ditto human-readable label if available. And possibly the creator of the resource. Whether knowing a resource is a container or not (by reading the container description) is very useful, that information can be derived as per shared slash semantics, hence it is not absolutely necessary that the container description includes resource types of contained resources.

csarven avatar Feb 04 '21 14:02 csarven

Can you add any reference to http/1.1 server specification with the information that is to be available on server side.

bourgeoa avatar Feb 04 '21 15:02 bourgeoa

I would rephrase the question here to be something more like:

A client needs a mechanism for finding descriptions of contained resources to further support navigation and application interaction.

I disagree that container listing is the best way to do this. A query endpoint (e.g. triple pattern fragments) can achieve the very same end with (arguably) better scalability characteristics.

The basic problem with including this data in a container relates to authorization.

Consider, for example, a container with 100 child resources. A simple GET request to the container will require an access check at the container level. Then 100 subsequent checks would be needed for each child resource. What happens with 1,000 child resources? 10,000 child resources? This does not scale.

The only way this scales is by introducing a paging mechanism such that you limit the scope of authZ enforcement to a predictable window size, which is why I suggest TPF.

acoburn avatar Feb 04 '21 15:02 acoburn

best way to do this

for whom? Agree from a server's point of view but not particularly attractive from an application's point of view. It is quite a burden for applications to fetch each resource to get a hold of what they need (along the lines of that's mentioned in the above use cases) in order to provide something usable.

I would consider having to collect the data through a query endpoint relatively more complex than getting it simply from the container representation. Moreover, servers are not required to provide a query endpoint - at the time of this writing - so the basic information wouldn't be consistently available to applications.

If your counter argument/proposal is to address the use cases above by querying, we need to introduce a query mechanism as a hard requirement. (Which would help to meet quite a bit of other needs but that's all besides the point).

This does not scale.

Generally agree but we need empirical data as mentioned. True that a container can theoretically hold infinite number of resources (I think). Are applications - with the understanding of hierarchical organisation of Solid storage - organising data such that containers with many resources is common (in the wild)? If at all, how is resource organisation or management factored in?

Servers may want to limit the number of members a container can have to a number it is comfortable with. Implementation detail.

Agree on needing pagination as a way to control the cost of a request/response which would be an alternative to above - server fixing the max number of resources allowed per container. Implementation detail.

csarven avatar Feb 04 '21 17:02 csarven

It is quite a burden for applications to fetch each resource to get a hold of what they need

This is not what I am suggesting. I agree that such an interaction is a non-starter: there are way too many HTTP round-trips. A query endpoint allows a client to retrieve all the information it needs in a single request.

This does not scale.

Generally agree but we need empirical data as mentioned.

Here is empirical data for a system that implements the "check every child resource" approach: https://wiki.lyrasis.org/display/FF/Many+Members+Performance+Testing You can see response times in the 60 second range for 10K child resources.

acoburn avatar Feb 04 '21 17:02 acoburn

Our definition of a container is this RDFS class called dh:Container.

As you can see, there's a related property dh:select that a container resource has. It points to a SPARQL SELECT query that the client can use to select the children resources of the container. Usually it's an entry point to further client-side query building that sets modifiers (LIMIT/OFFSET/ORDER BY), wraps into DESCRIBE etc.

So for example (prefixes missing):

<photos/> a dh:Container ;
  dh:select <queries/select-children/#this> .
  
<queries/select-children/#this> a sp:Select ;
  sp:text "SELECT ?child WHERE { { ?child sioc:has_parent ?this } UNION { ?child sioc:has_container ?this } }". # ?this is a magic variable which binds to the request URI

namedgraph avatar Feb 04 '21 17:02 namedgraph

Would it make any sense to have the listing of a container's contents follow the permissions on the container rather than the permissions on the contents? For example:

* private resource in a private container
   * unauthorized user can not view anything about the private resource
* private resource in a public container
   * unauthorized user can view size/last-modified/etc. but not GET content of the private resource

This would mean that the server never has to do a mass check of the permissions on its contents but the user would still have the option to hide the server-managed information when that is their intention.

jeff-zucker avatar Feb 04 '21 17:02 jeff-zucker

@acoburn

This is not what I am suggesting.

I know. I said that as the current solution to meet the needs. Querying, pagination or something else is currently not possible (=unspecified).

Thanks re Fedora data, that is useful. It is not easy (for me) to break it down as there are a number of different dimensions with varying values. The test with ~60s is perhaps on the higher end ("perhaps postgres needs caching configured?") - if you can provide more insight on this, that'd be useful. There is a can of warms here re caching of access policies..

Is there something along those lines available for Trellis?

csarven avatar Feb 04 '21 17:02 csarven

@namedgraph I presume you can filter based on authorization policy per resource? And the response time for request to /photos/ with different access controls on each contained item is marginally different to if each item is public-read?

csarven avatar Feb 04 '21 17:02 csarven

@jeff-zucker

Would it make any sense to have the listing of a container's contents follow the permissions on the container rather than the permissions on the contents?

No because each resource (container or other) can have different access controls. System must not leak any information about contained resources when agent is unauthorized to read those resources - last modification, size etc. are indeed sensitive and should not be exposed. The most a read access on a container permits is the visibility of the containment statements (just references).

csarven avatar Feb 04 '21 17:02 csarven

@acoburn wrote

The only way this scales is by introducing a paging mechanism such that you limit the scope of authZ enforcement to a predictable window size, which is why I suggest TPF.

The LDP group worked quite hard on a spec for paging. See: https://www.w3.org/TR/ldp-paging/

bblfish avatar Feb 04 '21 17:02 bblfish

@csarven

Re: Trellis, that code works as described by @jeff-zucker (authZ decisions are made based on container permissions, not based on access to the child resource). Trellis also does not include any information about the child resources, so it just sidesteps this issue. Consequently, container retrieval is measured in milliseconds.

For Fedora, there was a huge amount of work done related to this issue, and ultimately, many users began finding various work-arounds that just avoided using LDP containment, e.g.:

  • put everything in the root container, block access to that container (since requests would bring down the server) and manually manage all links in the child resources. This approach basically avoids using LDP on an LDP server.
  • create layers of intermediate containers (/container/af/03/21/b8/af0321b8-my-resource) so that no container ever has more than 256 child resources (this is a bit like a really basic paging mechanism though it still requires a lot of round trips)

In my own experience, the Fedora server just got really, really slow once you had more than a thousand child resources in a single container. There were various attempts to resolve this, but those efforts never really went anywhere with that tech stack. I don't know where things stand these days, but it led to a lot of people abandoning the project.

Re: Query -- I see paging and query as two ways of describing a very similar feature, and they are both really useful.

acoburn avatar Feb 04 '21 19:02 acoburn

@namedgraph I presume you can filter based on authorization policy per resource? And the response time for request to /photos/ with different access controls on each contained item is marginally different to if each item is public-read?

No ACL for children resources, no (yes for containers themselves). Since client-side containers is just UI for certain SPARQL queries, and we don't have ACL for plain SPARQL -- only for Linked Data resources. Once you have SPARQL access, you can pretty much see all the data, so it's a privilege to have.

namedgraph avatar Feb 05 '21 09:02 namedgraph

I recently noticed that ESS does not include the modified time because it's not part of the spec, and that makes apps unusable for large collections. So I'm very happy to see this :). I think my use-case has already been covered in previous comments, but I'll go over it briefly in case it's useful to see it from an app developer's perspective.

What I want to do in my app is reduce the quantity (and size) of network requests. Given that querying is not supported, the solution I've arrived at is caching everything in the client. This makes the first session slower, but makes subsequent sessions faster. It also improves the overall responsiveness of the app, because it doesn't have to make network requests for reading data. However, all of this depends on being able to read only the updates at the start of every session. So far, that's what I've been using the modified time for, and without it I can't think of a way to improve the application start up.

Something else that would be useful is knowing the types of resources included in the documents. For example, reading the type index I can find containers that include the types of resources I'm interested in. But that doesn't mean that a container doesn't have other types of resources, and I'd like to avoid reading documents that are not relevant to my app.

I understand that doing this can have an impact on server performance, so I don't have strong opinions as to how this information should be retrieved. I think it would make sense to return only containment triples by default, and use some mechanism like headers to indicate what other types of information is relevant.

Re:pagination, I suppose for really large amounts of data it would be necessary. With my current approach it's actually better to get everything in one request, given that I'll want to read all the documents that are relevant to my application (I was actually using globbing before it was deprecated). Pagination would be useful with query support - at that point I may be able to avoid caching everything - but given the current status this is the only viable solution I found.

NoelDeMartin avatar Feb 05 '21 10:02 NoelDeMartin

For the TrinPod server case in authenticating what RDF data to include in a container request:

We use a fully hierarchical authentication scheme that at the lowest level is a single statement, so our server first retrieves all the information that a request would have without authentication, then does an auth check on each statement that the authenticated user has access to to generate the final response. The hierarchical nature of the auth check in combination with the cached acls presents virtually no resource hit on the server side.

On the Application side, in creating our Files app which we are finishing now, we are arriving at the idea that a single request to a container should present enough information for the user to intelligently decide what they want to do next, such as expand a child branch of that container. So we would be very happy to support any proposed standards about what to include as part of a container request to improve the UX. I think the paging issue that @acoburn brings up is also very important, so a standard around that would be great too.

At the moment, as standards aren't yet in place, for TrinPod we are including in a request to a container: all the child nodes of the container with ldp:contains, and then the ldp:contains of those child nodes as well as the last event triples around the content in the requested container (such as any schma:UpdateAction around that content) of course all filtered by user access permissions.

gibsonf1 avatar Feb 05 '21 13:02 gibsonf1

* https://www.w3.org/TR/ldp-paging/ * https://www.w3.org/TR/activitystreams-core/#paging

Created issue for resource paging: https://github.com/solid/specification/issues/230

csarven avatar Feb 05 '21 13:02 csarven

@csarven I vote to make those two specs part of the Solid standard - but I think also needed would be a recommendation for how many items to include in a given page

gibsonf1 avatar Feb 05 '21 14:02 gibsonf1

@gibsonf1 If paging is required, I can't see why more than one mechanism is needed. The number of items to include for a paged resource would either be a client preference included in the request in which a server a may agree to or simply use its own (implementation detail).

csarven avatar Feb 07 '21 16:02 csarven

It would be worth having a comparison between both.

bblfish avatar Feb 07 '21 18:02 bblfish

I'm catching up here, and I appreciate that this is a summarization of several different things, and so I don't think it serves to pose this as a single question.

What I'm seeing here are at least these problems:

  1. Augment the data in the container with data to enable apps to present a summary view to the user.
  2. Augment the containment triples with minimal metadata that clients are likely to find useful to perform well.
  3. Ensure that the above data isn't exposed without authorization.

The first case is essentially a generalization of the Data Browser behavior where it looks for index.ttl to augment the view. I believe that this should be solved by having a predicate (e.g. rdfs:seeAlso or a subproperty thereof) in the container representation that points towards a resource that the client should get to do it. The applications will have to deal with authz so that no users gets data it shouldn't get, but I think that is the best solution anyway, as in many cases it may be OK to show a title and a thumbnail, but nothing more. We shouldn't place too many restrictions on this from the spec side.

Number 2 is essentially what we have referred to elsewhere as a File Scan operation. We haven't set down what a File Scan operation is, but in the context of Solid is pretty clear a File Scan operation is to read the contents of a container and it now requires read privileges on the container, and that should be adequate for now.

It is very interesting to read that @gibsonf1 has an implementation that performs well when checking access control for a tree, but in the interest of having a spec that many can implement, at least in the initial versions, I think it is correct to assume that it is rather hard to achieve that performance, as @acoburn has experienced. Thus, at least initially, we should make sure that a File Scan operation can be done with read privileges on the container only. Anything beyond that is not a File Scan operation.

Then, the question becomes what information a File Scan operation can legitimately expose. I think the above discussion and @acoburn 's comment in #116 makes it very clear that at least the containment triples are a part of the container representation, if you need the hidden file case, then you need to make a child container and then have other permissions on that.

My opinion, at least right now, is that there are some other attributes, like mtime, type and size are things that could be a part of the container representation in a File Scan operation. Again, if you need to protect those, make a container with different permissions.

There's also some precedence to this, Apache has a default index that exposes mtime and size by default.

In conclusion, number 2 above is the File Scan operation, which maps to a read operation on the container in Solid, which exposes containment triples, size, type and mtime as well as other server managed and client managed metadata.

But, there's more! ;-)

It could be argued that computing mtime and size is too heavy for most users, we shouldn't give that unless people ask for it. For that, I suggest we look into defining and registering a Prefer header preference. With this, clients could for example request the container with a Prefer: return=full, which would give them the full representation, including the mtime, size and type. Effectively, this would make it optional for servers to support it, but that's OK.

kjetilk avatar Jun 24 '21 10:06 kjetilk

Alternative to Prefer: https://datatracker.ietf.org/doc/html/draft-svensson-profiled-representations-00 (https://www.w3.org/TR/dx-prof-conneg/)

RubenVerborgh avatar Aug 10 '21 12:08 RubenVerborgh

After discussion at the editorial session of 2021-08-24, I will summarize my points on this issue below.

Considerations

  • To decide what MUST be included in descriptions, we need to consider:
    • the desirability of that information to be included in a description depending on different access permissions.
      • Concretely, imagine the worst case: Read on container but no permissions on any of its children.
    • the cost of creating the information on the server side
    • the applicability of certain metadata
      • If we mandate size, how is this calculated if the backend if a SPARQL endpoint? Number of triples, or size when serialized in Turtle?
      • If we mandate last-modified, how is this determined if the backend is an IoT device/sensor that does not store this data?
    • The allowed margin of error
      • size: exact to the byte?
      • last-modified: exact to the (milli-)second? are estimates allowed if we don't know? heuristics
  • We would need clear use cases to understand why certain fields are needed
    • Are size and last-modified being used as a proxy for etag? If so, why not expose etag?

My personal conclusion

Given the above considerations, I would propose:

  • we first standardize the absolute minimum set of information, and can extend later if there are use cases
  • concretely, we only mandate contains
  • we allow other metadata at the server's discretion

I understand that there are desires for more complex parts of the description, but I think process-wise, we can make progress by creating new issues for every additional part of metadata we would like to mandate beyond the minimum set of contains. Then we essentially break down this one large issue into specific issues, while already having the absolute minimum spec'ed.

RubenVerborgh avatar Aug 26 '21 11:08 RubenVerborgh

@RubenVerborgh I understand that but I think it is also going to be quite process heavy.

Could you not instead bundle all those issues together and then categorise those in different ways: ldp:contains as that closest to the core, then group the others into major application areas, and find out who supports them. It would be good if there were a document that would at least give some ideas as to which parts fit together, and how widespread they are, which servers have implemented them. Then one would know who to ask regarding their implementation experience.

bblfish avatar Aug 26 '21 11:08 bblfish

I would like to put in a word for mime-type support. Example use case : The databrowser looks at the triples of an NSS container and finds all contained resources whose media type matches iana/image and if some are found, presents a slideshow button to view them all. That's not possible on CSS which does not add such data to the container. It's hard to imagine that divulging that foo.png has a media-type containing iana/image would give away sensitive information.

jeff-zucker avatar Aug 28 '21 13:08 jeff-zucker

I have two concerns:

There could be some security or performance concerns around adding certain types of metadata, so leaving it entirely up to the server without public consultation like we do here could be problematic. Also, this kind of variability could cause interop problems. At present, the ecosystem is rather small, so I believe we can do it for now.

Secondly, my favorite field is query evaluation across Solid data, and I believe that Computer Science simply does not have the empirical or epistemological strength to create generalizable knowledge that will clearly guide us in this area. So, the fallback is then to add stuff and cross fingers ;-)

I just came to think of that adding mtime will cause a cascade towards the root as all containers up to root will have to be updated with that mtime as a result. That's probably not behavior we'd like to encourage right now, so perhaps that wasn't such a great idea anyway.

In the interest of progress, I think we can go with containment triples as the only requirement for now, but that we say that servers MAY include other metadata. Then, we add a section in the security concerns about the metadata, and we start opening other issues about metadata that MUSTs, and I think that @jeff-zucker is right that media type is a good candidate for that.

We still have the issue of data augmentation to deal with here though, i.e. the old index.ttl mechanism, which I believe should be dealt with using the rdfs:seeAlso predicate.

kjetilk avatar Aug 30 '21 10:08 kjetilk

I have found a way in Java to get access to the metadata at the same efficiency as getting the file name listings in a direction https://stackoverflow.com/questions/66699379/how-to-get-streams-of-file-attributes-from-the-filesystem/66713743#66713743 (I think. It would be worth testing this out just to make absolutely sure that the speed is equivelent)

This is the data you can get access to: https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/nio/file/attribute/BasicFileAttributes.html

bblfish avatar Aug 30 '21 10:08 bblfish

@jeff-zucker

I would like to put in a word for mime-type support.

+1, I think that mime-type would be incredibly useful. (So, specifically, containment + mime-type being the minimum mandated fields.)

dmitrizagidulin avatar Aug 31 '21 12:08 dmitrizagidulin

We did discuss this further in the Solid Editors meeting today, but we have noted that we haven't yet reached a rough consensus.

Two things seem very clear though:

  1. We can't require a server to look up access controls for child resources as that would complicate servers substantially.
  2. There seems to be quite broad agreement that metadata can leak information and so create a security concern.

We found that these two observations, by themselves do not require any changes to the current, but also does little to address the original concern of this issue.

This does suggest to me that it the container resource and the metadata needs to have potentially different authorizations, so that it is up to a client with Control privileges to decide whether an agent gets to see the metadata or not.

This again implies that it must either be configurable where the metadata goes, or it would need to go into a separate auxiliary resource. It makes sense to leave some variability here. For a server that does not share the security concern, it is OK to have the metadata in the container description itself, but it must then be aware that it can't be influenced by the client. I therefore think we should look in the direction of having a augmentation resource #144 that has its own access control, as suggested in #306 .

By default, metadata should go to such an augmentation resource, but it could be configurable to allow all or some to be present in the container.

That's my current opinion.

kjetilk avatar Aug 31 '21 23:08 kjetilk

This does suggest to me that it the container resource and the metadata needs to have potentially different authorizations, so that it is up to a client with Control privileges to decide whether an agent gets to see the metadata or not.

That sounds like a very good idea.

When a security problem appears for which there are use cases where it does not matter and indeed where information wants to be shared and also use cases where it does matter and information must be tightly controlled, make the settings configurable. Then one can also develop guidelines for different situations.

bblfish avatar Sep 01 '21 06:09 bblfish

An important clarification to @jeff-zucker's point:

The databrowser looks at the triples of an NSS container and finds all contained resources whose media type matches iana/image and if some are found, presents a slideshow button to view them all. That's not possible on CSS which does not add such data to the container.

I is not impossible, just slower; you can always HEAD every single item. I know that it is not practical etc., but we need to distinguish between "impossible" and "performance optimization". For instance, not listing children makes them undiscoverable and thus in some cases literally impossible to access.

Premature optimization can lead to suboptimal designs. Let's get things to work first, and then optimize as needed.

RubenVerborgh avatar Sep 01 '21 07:09 RubenVerborgh

What I meant is that it isn't possible to get the information from reading the container in CSS, not that it is impossible in general.

jeff-zucker avatar Sep 01 '21 07:09 jeff-zucker

re efficiency optimization see my answer above. In Java I think one can get the following metadata as easily as the file listings. That should not be surprising, given that the OS will be storing the name of a file very close to where it has all that other information too. Also note I have seen it argued that solid state drives have completely transformed the relation between processor speed and disk speed to the point that disk speed is now faster than what processors can cope with. The point is optimization requirements are important, but they can also be evaluated empirically.

Modifier and Type Method Description
FileTime creationTime()
Returns the creation time.
Object fileKey()
Returns an object that uniquely identifies the given file, or null if a file key is not available.
boolean isDirectory()
Tells whether the file is a directory.
boolean isOther()
Tells whether the file is something other than a regular file, directory, or symbolic link.
boolean isRegularFile()
Tells whether the file is a regular file with opaque content.
boolean isSymbolicLink()
Tells whether the file is a symbolic link.
FileTime lastAccessTime()
Returns the time of last access.
FileTime lastModifiedTime()
Returns the time of last modification.
long size()
Returns the size of the file (in bytes).

bblfish avatar Sep 01 '21 08:09 bblfish

In Java I think one can get the following metadata as easily as the file listings.

Non-distributed filesystems are just one possible backend though; there are many others.

Perhaps we should list all of these, and work out what the properties that they can make available and at what cost. (I have just provided one data point, to argue that some metadata may not be that expensive on simple file systems that may be used in a very wide range of cases.)

Then one should consider is what Apps need such data, and why: what are their requirements? After all it is only good apps that will make the Solid ecosystem grow.

One should not forget that one could later make optimisations such as using SPARQL to query a container and that if the need is there one can optimise with indexes...

bblfish avatar Sep 01 '21 08:09 bblfish