activitystreams icon indicating copy to clipboard operation
activitystreams copied to clipboard

Paged and unpaged collections ought to be distinct and disjoint

Open trwnh opened this issue 9 months ago • 12 comments

Preface: this is pretty firmly next version stuff.

As a quick refresher:

  • Collection is the base class for representing collections, which are an indirection mechanism and container for some items considered to be part of the collection.
  • A Collection can have items, but it can also have first/last/current.
  • A Collection with items can be considered to be either an unpaged Collection in its entirety, or otherwise it is a CollectionPage that is conceptually part of some larger Collection.
  • A CollectionPage can have next/prev. Without these, the only way to differentiate between an "unpaged collection" and a "collection page" is to explicitly declare a type.
  • CollectionPage inherits from Collection, so more precisely the question is whether it is a "page" or not.

Taking a look at it from the perspective of properties and their domains/ranges:

  • Something with items is a Collection
  • Something with first/last/current is currently also considered to be a Collection, but it is specifically a "paged collection" and not simply a "collection".

We note that:

  • It makes no sense to have a collection that is both unpaged and paged, i.e., it makes no sense to have a collection with both items and first/last/current. This defeats the entire purpose of paging.

Given the conclusion above that a collection cannot or should not be both unpaged and paged at the same time, does it then make sense to say that we ought to reify this with a distinct PagedCollection type? At the very least, there's something that doesn't quite align, and we ought to think about if anything can be done to reconcile this.

I think that the intent is to have some URI resolve to an object that can be dynamically paged or unpaged, but the model doesn't fully support that. It could be made to support that if we recognized a distinction between a Collection and a PagedCollection, though. Clients could even request or negotiate whether they prefer or accept a paged or unpaged collection?

trwnh avatar Feb 09 '25 05:02 trwnh

Whether or not to respond to a request with a paged collection has to be 100% in the server's control, because it's an important availability goal of the server to prevent any one client from using up too many resources. So we shouldn't have any sort of way for clients to indicate or negotiate which one they' prefer—all clients need to be able to handle both.

I think that the intent is to have some URI resolve to an object that can be dynamically paged or unpaged, but the model doesn't fully support that

Why not? Client should be able to handle the presence/absence of the first property accordingly.

nightpool avatar Feb 10 '25 16:02 nightpool

Yes, the Client "should be able to handle [it]", but that's not the point of the issue. The point is that a paged collection and an unpaged collection are different things and it doesn't make sense to have a collection that is both paged and unpaged.

Think through the example more concretely. "Presence/absence of the first property" is only part of the picture. What does it mean if the dereferenced object has both first and items? Does the collection contain what's in items, or does it contain what's in first/(next*)/items? The former? The latter? Both? Saying that a Collection can have items and also first/last/current is confusing. The latter 3 properties indicate that the collection is actually a paged collection.


Whether or not to respond to a request with a paged collection has to be 100% in the server's control, because it's an important availability goal of the server to prevent any one client from using up too many resources.

This is fine, but

So we shouldn't have any sort of way for clients to indicate or negotiate which one they prefer

This does not follow from the above proposition. In server-driven content negotiation, the server does control what is returned, but that doesn't stop the client from signaling what they prefer. For example, if the client provides an Accept* header listing some preferred representation, the server can respond how it wants -- either return a default representation, or return a 406 Not Acceptable.

I would say that the server should have a clean way of being able to offer both paged and unpaged representations. But at the very least, it should be clear when a collection is paged or unpaged. It should not be possible for a collection to be both paged and unpaged.

trwnh avatar Feb 11 '25 08:02 trwnh

Wouldn't a multi-modal collection just be one that has both items and first/last/current (as top-level Links)?

To me it makes a lot of sense, as clients that want a bulk download can prioritise items and clients that want to page through can prioritise first/last/current at no additional effort compared to supporting both types as-is. (Perhaps it should be documented that this is something servers may do and that the endpoint indicated by items should respond with all items at once if present.)

A good example for a collection that could be multi-modal are someone's followers, as a server may want to synchronise that in bulk sometimes while a client (or on-demand proxy) would likely prefer to page through.

Tamschi avatar Feb 11 '25 10:02 Tamschi

@Tamschi if items is present, then there's ostensibly no need to page through. It's not an endpoint, it's an array []. If the paging properties are there and you follow them, all you're doing is making completely useless and unnecessary HTTP requests, because the first HTTP request has all the information you need. On the other hand, if the intent is to page through using the paging properties, then items is a waste of bandwidth and size.

In fact, having both items and paging properties is something that opens the door to incoherence and inconsistency. What if the set of items is different than what you get by paging? What does the collection actually contain -- the unpaged items, or the paged items?

trwnh avatar Feb 11 '25 10:02 trwnh

@Tamschi if items is present, then there's ostensibly no need to page through. It's not an endpoint, it's an array []. If the paging properties are there and you follow them, all you're doing is making completely useless and unnecessary HTTP requests, because the first HTTP request has all the information you need. On the other hand, if the intent is to page through using the paging properties, then items is a waste of bandwidth and size.

In fact, having both items and paging properties is something that opens the door to incoherence and inconsistency. What if the set of items is different than what you get by paging? What does the collection actually contain -- the unpaged items, or the paged items?

No, that's incorrect if I'm not completely misreading the spec. The type of items is "Object | Link | Ordered List of [Object | Link ]" as of https://www.w3.org/TR/activitystreams-vocabulary/#dfn-items, so the ordered list could well be (in) another document.

Tamschi avatar Feb 11 '25 10:02 Tamschi

Range is the value. When something has a range of "ordered list", it is not another document, it is a JSON-LD @list, which in plain JSON is also an array []. Generally, JSON-LD has both @set and @list, with the default being unordered @set, but you can define certain terms to be ordered @list.

trwnh avatar Feb 11 '25 10:02 trwnh

In that case I agree with @trwnh , insofar that it would be very helpful to be able to cleanly present both access options unambiguously.

In practice there is a range of collections where it's by far most efficient in terms of server load to offer both and let the client choose, even if the client has to support either.

Tamschi avatar Feb 11 '25 10:02 Tamschi

The point is that a paged collection and an unpaged collection are different things and it doesn't make sense to have a collection that is both paged and unpaged.

I'm not sure I agree with this. I don't think a kind of "unpaged" collection should be a thing. The only reason I can think of using them is intentionally-limited collections like "pinned posts" but even in that case it's such a rare case that i don't think it's worth worrying about. Similarly i'm not sure I understand why a collection would ever be both "paged and unpaged" and I don't see any motivating usecase that would require excluding that from the spec

nightpool avatar Feb 13 '25 00:02 nightpool

I don't think a kind of "unpaged" collection should be a thing. The only reason I can think of using them is intentionally-limited collections like "pinned posts" but even in that case it's such a rare case that i don't think it's worth worrying about.

Unpaged collections make a lot of sense when you want a single resource, e.g. to represent and easily synchronize members of an indirect set but without directly exposing it as a @set ([]).

For example, consider a collection representing posts in a conversation as moderated by some authority:

{
  "@context": "https://www.w3.org/ns/activitystreams",
  "id": "https://postparty.example/conversations/1/posts",
  "type": "Collection",
  "summary": "Posts included in conversation 1."
  "items": [  // an array representing a @set of objects to be included by their id
    "https://alice.example/posts/1",
    "https://bob.example/posts/2",
    // ...
  ],
  "totalItems": 973
}

You can GET this single resource and cache it for some TTL. You can apply access control if you don't want this information to be fully public. And so on. All of this happens outside the boundary of the posts themselves, and all it takes is 1 HTTP GET instead of almost 100 requests (at a page size of 10) or worst-case 974 requests (at a page size of 1). It can be served by a static file that gets mutated whenever the collection is modified.

In general, any collection that serves small-scale synchronization duties is going to benefit from being unpaged. Paging is extra overhead that not every server is going to want to do, or even necessarily be capable of doing. There is value in having a single resource that requires only 1 HTTP GET and minimal processing.


Arguably, the paging mechanism we have is more generally a weakness compared to the alternative way of splitting a set into RDF statements that together describe a set and what it contains. In a more RDF-y model, you could stream an arbitrary number of statements line-by-line:

<users/0> as:IsFollowedBy <users/1>.
<users/0> as:IsFollowedBy <users/2>.
<users/0> as:IsFollowedBy <users/3>.
<users/0> as:IsFollowedBy <users/4>.
# and so on...

In JSON, we could stream this, say, 2 at a time:

{
  "id": "users/0",
  "IsFollowedBy": [
    "users/1",
    "users/2"
  ]
}
{
  "id": "users/0",
  "IsFollowedBy": [
    "users/3",
    "users/4"
  ]
}

The change in mindset is that instead of replacing the prior data, we instead merge:

{
  "id": "users/0",
  "IsFollowedBy": [
    "users/1",
    "users/2",
    "users/3",
    "users/4"
  ]
}

I leave it as an exercise for the reader to fully work out the particulars of how such a streaming system could replace a paging system, but suffice to say that the act of reifying a set into a Collection is a kind of indirection, which allows us to describe that set... but if you go so far as to page that Collection, you can then reify the CollectionPage in a second-level indirection. If we're doing all that, then we ostensibly don't have a simple set anymore. We've got two layers of indirection between the set and what it contains -- a Collection has CollectionPage, and then a CollectionPage has items, and we are supposed to somehow know (out-of-band) that the Collection "contains" the items of its linked pages? It would be more correct and straightforward to say that we no longer have a direct set, but rather, a linked list of indirect sets. The equivalence between the paged collection and its equivalent unpaged collection is not an identity equivalence. They're different things. We've given identity to each of those pages, and the sum total of the pages is what we could call a PagedCollection. We are dealing with that linked list separately from how we deal with the accumulated items within the conceptual "set of all items".


So if we want to retain the data model of AS2 Collections and CollectionPages, we need to make a slight modification to the hierarchies:

# The base model of a `Collection` is that all collections can have `totalItems`.
as:Collection a rdfs:Class.
as:totalItems rdfs:domain as:Collection.

# Here we introduce an `UnpagedCollection` which can have `items`
:UnpagedCollection rdfs:subClassOf as:Collection.
as:items rdfs:domain :UnpagedCollection.

# Here we introduce a `PagedCollection` which can have pointers to pages
:PagedCollection rdfs:subClassOf as:Collection.
as:first rdfs:domain :PagedCollection.
as:last rdfs:domain :PagedCollection.
as:current rdfs:domain :PagedCollection.

# We explicitly declare that a collection cannot be both paged and unpaged.
:PagedCollection owl:disjointWith :UnpagedCollection.
:UnpagedCollection owl:disjointWith :PagedCollection.

# `CollectionPage` is mostly the same but now inherits from `UnpagedCollection`.
# This is because pages can have items, but ideally cannot themselves be paged.
# (Do we really want to deal with hierarchical pages of unbounded depth?)
as:CollectionPage rdfs:subClassOf as:Collection, :UnpagedCollection.
as:partOf rdfs:domain as:CollectionPage.
as:next rdfs:domain as:CollectionPage.
as:prev rdfs:domain as:CollectionPage.

# Finally, the Ordered classes can be left mostly unchanged, although
# I would say that they should utilize multityping instead of inheritance.
# That is to say, a collection can be both `UnpagedCollection` and `OrderedCollection`,
# but it would be awkward to define and use a class of `OrderedUnpagedCollection` and so on.
as:OrderedCollection a rdfs:Class.
as:OrderedCollectionPage rdfs:subClassOf as:OrderedCollection, as:CollectionPage.
as:startIndex rdfs:domain as:OrderedCollectionPage.

For anyone who can't read Turtle, this is basically saying that:

  • Collection retains totalItems
  • items is split off into UnpagedCollection
  • first/last/current is split off into PagedCollection
  • CollectionPage now also inherits from UnpagedCollection (because a page needs items, and it's a bad idea to allow for pages to be recursively paged)

Everything else is basically unchanged. We hold off on defining OrderedUnpagedCollection because it should be possible to just multitype/compose ["OrderedCollection", "UnpagedCollection"]. In other words, let OrderedCollection be a mixin or interface that declares the items to be ordered strictly.

Side note: It's pretty messy that orderedItems is the same predicate IRI as items, because conceptually they are two different things. The former is a list, the latter is not. They should be separate properties, with items having a range of anything, but orderedItems having a range of rdf:List... But that's kind of a separate issue...

trwnh avatar Feb 13 '25 08:02 trwnh

We talked about this a lot today, in the Forum/Threaded Discussions Task Force Meeting. Short form: There should be no conception of a "paged collection" nor any other "paged resource". Whether delivery of any resource is paged or not should be negotiated between the consumer (who knows how much they can take, e.g., RaspberryPi can't take much) and the server (who may now know how big the total is, but should be able to throttle or page in various ways). HTTP (and LDP and Solid and ...) provides all the tools needed to make this work.

TallTed avatar Feb 13 '25 20:02 TallTed

Whether or not the conflation of paged and unpaged collections was a good idea, it is already standardized and we will need to deal with the backwards compatibility for this in any new version.

It makes no sense to have a collection that is both unpaged and paged,

I'm not entirely convinced of this argument, although I don't have a counterexample ready at hand. Making a big change in the object model because we can't think of an example might not be a good strategy.

However, I think it's possible to make a new set of types, like PagedCollection, UnpagedCollection which have restricted properties, and use those instead of the Collection type, maybe with multi-typing. Making this case in the description of those types could lead to deprecation of Collection in a later version.

I'd like to mark this as a Next Version issue, since re-visiting the Collection hierarchy is probably a good idea for that kind of exercise.

evanp avatar Feb 14 '25 17:02 evanp

This issue has been labelled as potentially needing a FEP, and contributors are welcome to submit a FEP on the topic. Note that issues may be closed without the FEP being created; that does not mean that the FEP is no longer needed.

github-actions[bot] avatar Feb 14 '25 17:02 github-actions[bot]