nips Add NIP-45 for COUNT

trafficstars

This is the beginning of sort of a second-layer indexing scheme, I came across the need for it when implementing follower count. Currently I'm downloading ~4000 events in the case of jack, and throwing them away just to get the count.

Jan 04 '23 04:01 staab

This is interesting, but I think it might be better to make a new message, separate from REQ, to just ask for counts. It could take the same filter format and just return the count of events instead of the events themselves.

Jan 04 '23 10:01 fiatjaf

A REQC command could be trivial to implement for relays using the existing REQ filter format, a response EVENTC message can return the count for each filter:

Example: ["REQC", { filter1 }, { filter2 }] ["EVENTC", n_filter1, n_filter2, ...]

Some relays may also artificially limit all results sets to avoid abuse, counting rows can be very expensive so relays should also implement abuse prevention for this type

Jan 04 '23 11:01 v0l

I also think that it would be better to simply have a new command 'COUNT' that works exactly like 'REQ' but returns just the events.length:

request:

['COUNT', 'sub_id', {filter}]

response:

['COUNT', 'sub_id', 'num']

and what is the expected behaviour if you don't close the req? should the relay keep sending new counts (i.e. num++ ) or is this like a Notice single message?

Jan 04 '23 15:01 eskema

I think I prefer @eskema's version as well. I don't think streaming count is useful enough for the complexity, so I would make it just a single message. I'll revise the NIP when I have a chance.

Jan 04 '23 16:01 staab

Updated the PR, let me know what you think. I'm slightly inclined to make this a little more open ended in case someone wants something other than a count, for example a unique list of pubkeys, or rounded time series data. But I suppose we can re-use the GROUP verb with a different request verb.

Jan 06 '23 01:01 staab

not sure I like this approach, the groups seem unnecessary, why not send a new req for the counts you need instead of lumping it like that? or, just return a count for each filter if you want multiples in a req, but subgroups in the reqs seem complicated for no reason. why woold you lump a bunch of requests and ask separate counts?

Jan 06 '23 01:01 eskema

In order to get certain groups, you would have to know what filters to use, which would require retrieving all events in the first place. For example, notes by pubkey:

["COUNT", "", {filters: {kinds: [1], since: <timestamp>}, group_by: ['pubkey']}]
["GROUP", "", {group: [<pubkey1>], count: 238}]
["GROUP", "", {group: [<pubkey1>], count: 21}]

Jan 06 '23 01:01 staab

Isn't this trying to replace SQL with a JSON query language?

I don't have a clear reason for this, but it doesn't seem right to treat relays as general-purpose databases.

Jan 06 '23 01:01 fiatjaf

It does cover the GROUP BY case popularized by sql, yes. Just like limit = limit, since/until = order by, offset, kind/#e/#p/author = where. Grouping is useful.

Jan 06 '23 01:01 staab

I think it should simply be:

["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey1']}]
["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey2']}]

i.e, you do the grouping yourself before asking the relay, or

["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey1']},  {kinds: [1], since: <timestamp>, authors: ['pubkey2']}]

and the response be count1, count2

Jan 06 '23 02:01 eskema

@fiatjaf I see what you mean though, this opens the door to more and more complexity, while second-layer protocols or centralized services can perform this function successfully. I'm just not sure where to draw the line. Probably this side of optimizations.

Jan 06 '23 02:01 staab

Alright, since groups seem to be unpopular I've removed them, please take another look.

Jan 11 '23 16:01 staab

Reactions are the one thing I see that would most profit from a "count" nip but in its current form it doesn't profit at all and as the example was not brought up in the discussion, please pardon for beating this dead horse:

I think the most useful for light clients would be to stay close to the REQ syntax but optionally allow to group_by. To still allow matching queries to results, the easiest approach is to have an array per query, with those using group_by returning the fields grouped by, too.

['COUNT', '',
    {kinds: [7], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [1717171], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [3]}
    ]
['COUNT', '', 
    [{'#e': <eid1>, content: '', count: 12},
        {'#e': <eid1>, content: '+', count: 5},
        {'#e': <eid1>, content: '+ ', count: 1},
        {'#e': <eid1>, content: 'banana', count: 1},
        {'#e': <eid1>, content: '🤢', count: 3},
        {'#e': <eid1>, content: '🍆', count: 2},
        {'#e': <eid2>, content: '', count: 1}
    ],
    [],
    [{count: 5745}]]

While I think the above is preferable, it could of course also be pruned by some implicitly given information:

['COUNT', '',
    {kinds: [7], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [1717171], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [3]}
    ]
['COUNT', '', 
    [[<eid1>,
        ['', 12],
        ['+', 5],
        ['+ ', 1],
        ['banana', 1],
        ['🤢', 3],
        ['🍆', 2]],
    [[<eid2>,
        ['', 1]],
    [],
    5745]

Jan 11 '23 18:01 Giszmo

Also while I do feel pity for my server that has to process tons of data to return some simple "12432", we have to see what is the overall most efficient way of achieving certain results and if the server doing x3 the work so the client can do just 1/5 of the work then that is acceptable for some client devs and server operators.

I assume that at scale, users will pay one way or another for resources used on the servers and I'm totally fine with a free tier that does not see likes beyond likes from their follows for example. The idea here is to standardize what's useful for some and I totally see the usefulness for some in the group_by.

Jan 11 '23 19:01 Giszmo

Ack, I've had this implemented in wss://relay.nostr.band more than a month ago, somehow discussion got lost in the telegram group, you can try ["COUNT","1",{"#p":["84dee6e676e5bb67b4ad4e042cf70cbd8681155db535942fcc6a0533858a7240"], "kinds":[3]}]

Jan 24 '23 17:01 brugeman

Concept ACK I'd like to see this functionality, will test relays with this capability and implement it in my client.

Jan 31 '23 21:01 pseudozach

How about adding that clients should check NIP-11 first to see if they support NIP-45 before issuing COUNT requests.

Jan 31 '23 22:01 mikedilger

We should probably merge this.

Feb 01 '23 00:02 fiatjaf

Different relays have different events and counts, so how would you get to the final count? You can't just add them up. Right now you have to get different events from multiple relays, filter the duplicates and thats the count from the client perspective.

Feb 02 '23 06:02 fabianfabian

That's a super good point, I hadn't thought of that. I do think COUNT could still be useful as an approximation (since it is anyway), you could take the max of multiple results, or select which relay to request a count from.

Feb 02 '23 22:02 staab

Aaand I've changed my mind, I've decided that COUNT is not useful. If we add relay extensions we could put something together that would be more complete.

Feb 15 '23 17:02 staab

Hi,

I'm adding COUNT to my metadata / contact indexing nodes (wss://us.rbr.bio and wss://eu.rbr.bio) for follower counts, and it was easy to implement, but I have a few questions:

Why {count: 30} when ["COUNT","subname",30] would already contain the information and is simpler? For group_by extensions it makes sense to have an array, but for simple counts I would prefer just a simple number (this is just a nitpick though).

The more interesting question: I would just like to support a few types of COUNT and group_by queries. I think the best would be if I could specify the types of queries my relay supports in the relay information document (although that should be handled in another NIP).

And the main question: how should the server reply if it doesn't handle the specific COUNT query? Should the relay just reply with a NOTICE, or EOSE, or do nothing? I guess NOTICE would be the most backwards compatible, but NOTICE doesn't contain the subscription name, as it just supports 1 message.

Mar 12 '23 20:03 adamritter

One more extra question: the NIP contains just "" as the second parameter, but I guess it's a label (maybe not subscription) that helps identifying the reply. It should be clear in the documentation.

Mar 12 '23 20:03 adamritter

I'm adding COUNT to my metadata / contact indexing nodes

This is exactly the place that COUNT makes sense. It's not good for direct use with multiple relays, but if you're using a relay multiplexer or indexer, COUNT can work.

To answer @adamritter's questions:

For simple counts I'm not using a number so that additions to the NIP can be made later without breaking backwards compatibility. One rule of thumb I've learned is always return an associative data type from APIs that aren't easily refactored.
I would guess NOTICE if not supported, or just ignore the request, but I don't really know.
The "" parameter is the subscription id, standard with REQ etc. Probably a good idea to clarify though

I'm not sure about selectively supporting COUNT attributes/groups, can you share the reason for that?

Mar 13 '23 17:03 staab

I'm not sure about selectively supporting COUNT attributes/groups, can you share the reason for that?

I'm running 2 relay servers that hold all metadata and contacts of all relays in RAM and serves them (wss://us.rbr.bio and wss://eu.rbr.bio).

I implemented it with hash maps in RAM, I don't use any query engine.

Already added follower counts support, but it's also a specific hash map just for followers:

https://iris.to/npub1dcl4zejwr8sg9h6jzl75fy4mj6g8gpdqkfczseca6lef0d5gvzxqvux5ey

I'm also planning to implement group_by authors/pubkey just for getting the list of all followers, as it's important and supported by the data structures in my server.

I was thinking of the selective supporting because most relay implementations have secondary indices that efficiently support count for certain types of queries, but it maybe not important to require those indices to be used for group_by operations (for example the reactions are not that many so group_by is easy for reactions on the server even if there is no index for it).

To tell you the truth, I'm also not sure about this selective support (I'm trying to conform to the standard as much as possible), but what's more important is that even if we do it, it shouldn't be part of this NIP, so there's no need to take a decision.

There may be millions of group by results though, so the group_by NIP should maybe specify a limit: and returning limited number of results (relays usually have a default max limit anyways).

Mar 13 '23 20:03 adamritter

The current specification doesn't specify if multiple counts are supported or not in 1 query (I think they should be supported).

Right now there's both {count: 3} and [{count:3}] as suggestions for simple counts, I prefer {count:3}, and using arrays for group by when we extend count to support group by (which is needed)

Mar 13 '23 22:03 adamritter

I implemented basic group_by support for getting followers on wss://us.rbr.bio:

["COUNT","hello",
      {"#p":["85080d3bad70ccdcd7f74c29a44f55bb85cbcd3dd0cbb957da1d215bdb931204"],"kinds":[3]},
      {"#p":["85080d3bad70ccdcd7f74c29a44f55bb85cbcd3dd0cbb957da1d215bdb931204"],"kinds":[3],"group_by":["pubkey"]}]

["COUNT","hello",{"count":50538}, 
  [{"pubkey":"82341f882b6eabcd2ba7f1ef90aad961cf074af15b9ef44a09f9d2a8fbfbe6a2","count":1},...

(returns top 1000 followers by popularity)

Mar 14 '23 01:03 adamritter

I've re-introduced group_by to fit @adamritter's implementation. If relays prefer not to implement it, they can raise a NOTICE or ignore the group_by key. I think we should go ahead and merge this, as there are multiple implementations in the wild, and despite the problems with merging results from a relay, COUNT can be done more reliably by either indexers or multiplexers.

Mar 31 '23 14:03 staab

This still looks like it is not really solving any problems, specially the group_by clause, it makes no sense to me.

Mar 31 '23 15:03 fiatjaf

Can you elaborate? Currently coracle only uses COUNT in one place, but it allows me to avoid downloading many megabytes of data to populate a single number. A couple use cases for group_by are enumerated above. Your comment that it might be burdensome to relays is valid, but it's opt-in as written, and can be useful for analytics as well as more common uses.

Mar 31 '23 15:03 staab

nips nips copied to clipboard

Add NIP-45 for COUNT

nips
nips copied to clipboard