nips icon indicating copy to clipboard operation
nips copied to clipboard

Add NIP-45 for COUNT

Open staab opened this issue 2 years ago • 21 comments
trafficstars

This is the beginning of sort of a second-layer indexing scheme, I came across the need for it when implementing follower count. Currently I'm downloading ~4000 events in the case of jack, and throwing them away just to get the count.

staab avatar Jan 04 '23 04:01 staab

This is interesting, but I think it might be better to make a new message, separate from REQ, to just ask for counts. It could take the same filter format and just return the count of events instead of the events themselves.

fiatjaf avatar Jan 04 '23 10:01 fiatjaf

A REQC command could be trivial to implement for relays using the existing REQ filter format, a response EVENTC message can return the count for each filter:

Example: ["REQC", { filter1 }, { filter2 }] ["EVENTC", n_filter1, n_filter2, ...]

Some relays may also artificially limit all results sets to avoid abuse, counting rows can be very expensive so relays should also implement abuse prevention for this type

v0l avatar Jan 04 '23 11:01 v0l

I also think that it would be better to simply have a new command 'COUNT' that works exactly like 'REQ' but returns just the events.length:

request:

['COUNT', 'sub_id', {filter}]

response:

['COUNT', 'sub_id', 'num']

and what is the expected behaviour if you don't close the req? should the relay keep sending new counts (i.e. num++ ) or is this like a Notice single message?

eskema avatar Jan 04 '23 15:01 eskema

I think I prefer @eskema's version as well. I don't think streaming count is useful enough for the complexity, so I would make it just a single message. I'll revise the NIP when I have a chance.

staab avatar Jan 04 '23 16:01 staab

Updated the PR, let me know what you think. I'm slightly inclined to make this a little more open ended in case someone wants something other than a count, for example a unique list of pubkeys, or rounded time series data. But I suppose we can re-use the GROUP verb with a different request verb.

staab avatar Jan 06 '23 01:01 staab

not sure I like this approach, the groups seem unnecessary, why not send a new req for the counts you need instead of lumping it like that? or, just return a count for each filter if you want multiples in a req, but subgroups in the reqs seem complicated for no reason. why woold you lump a bunch of requests and ask separate counts?

eskema avatar Jan 06 '23 01:01 eskema

In order to get certain groups, you would have to know what filters to use, which would require retrieving all events in the first place. For example, notes by pubkey:

["COUNT", "", {filters: {kinds: [1], since: <timestamp>}, group_by: ['pubkey']}]
["GROUP", "", {group: [<pubkey1>], count: 238}]
["GROUP", "", {group: [<pubkey1>], count: 21}]

staab avatar Jan 06 '23 01:01 staab

Isn't this trying to replace SQL with a JSON query language?

I don't have a clear reason for this, but it doesn't seem right to treat relays as general-purpose databases.

fiatjaf avatar Jan 06 '23 01:01 fiatjaf

It does cover the GROUP BY case popularized by sql, yes. Just like limit = limit, since/until = order by, offset, kind/#e/#p/author = where. Grouping is useful.

staab avatar Jan 06 '23 01:01 staab

I think it should simply be:

["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey1']}]
["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey2']}]

i.e, you do the grouping yourself before asking the relay, or

["COUNT", "", {kinds: [1], since: <timestamp>, authors: ['pubkey1']},  {kinds: [1], since: <timestamp>, authors: ['pubkey2']}]

and the response be count1, count2

eskema avatar Jan 06 '23 02:01 eskema

@fiatjaf I see what you mean though, this opens the door to more and more complexity, while second-layer protocols or centralized services can perform this function successfully. I'm just not sure where to draw the line. Probably this side of optimizations.

staab avatar Jan 06 '23 02:01 staab

Alright, since groups seem to be unpopular I've removed them, please take another look.

staab avatar Jan 11 '23 16:01 staab

Reactions are the one thing I see that would most profit from a "count" nip but in its current form it doesn't profit at all and as the example was not brought up in the discussion, please pardon for beating this dead horse:

I think the most useful for light clients would be to stay close to the REQ syntax but optionally allow to group_by. To still allow matching queries to results, the easiest approach is to have an array per query, with those using group_by returning the fields grouped by, too.

['COUNT', '',
    {kinds: [7], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [1717171], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [3]}
    ]
['COUNT', '', 
    [{'#e': <eid1>, content: '', count: 12},
        {'#e': <eid1>, content: '+', count: 5},
        {'#e': <eid1>, content: '+ ', count: 1},
        {'#e': <eid1>, content: 'banana', count: 1},
        {'#e': <eid1>, content: '🤢', count: 3},
        {'#e': <eid1>, content: '🍆', count: 2},
        {'#e': <eid2>, content: '', count: 1}
    ],
    [],
    [{count: 5745}]]

While I think the above is preferable, it could of course also be pruned by some implicitly given information:

['COUNT', '',
    {kinds: [7], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [1717171], '#e': [<eid1>, <eid2>], group_by: ['#e', 'content']},
    {kinds: [3]}
    ]
['COUNT', '', 
    [[<eid1>,
        ['', 12],
        ['+', 5],
        ['+ ', 1],
        ['banana', 1],
        ['🤢', 3],
        ['🍆', 2]],
    [[<eid2>,
        ['', 1]],
    [],
    5745]

Giszmo avatar Jan 11 '23 18:01 Giszmo

Also while I do feel pity for my server that has to process tons of data to return some simple "12432", we have to see what is the overall most efficient way of achieving certain results and if the server doing x3 the work so the client can do just 1/5 of the work then that is acceptable for some client devs and server operators.

I assume that at scale, users will pay one way or another for resources used on the servers and I'm totally fine with a free tier that does not see likes beyond likes from their follows for example. The idea here is to standardize what's useful for some and I totally see the usefulness for some in the group_by.

Giszmo avatar Jan 11 '23 19:01 Giszmo

Ack, I've had this implemented in wss://relay.nostr.band more than a month ago, somehow discussion got lost in the telegram group, you can try ["COUNT","1",{"#p":["84dee6e676e5bb67b4ad4e042cf70cbd8681155db535942fcc6a0533858a7240"], "kinds":[3]}]

brugeman avatar Jan 24 '23 17:01 brugeman

Concept ACK I'd like to see this functionality, will test relays with this capability and implement it in my client.

pseudozach avatar Jan 31 '23 21:01 pseudozach

How about adding that clients should check NIP-11 first to see if they support NIP-45 before issuing COUNT requests.

mikedilger avatar Jan 31 '23 22:01 mikedilger

We should probably merge this.

fiatjaf avatar Feb 01 '23 00:02 fiatjaf

Different relays have different events and counts, so how would you get to the final count? You can't just add them up. Right now you have to get different events from multiple relays, filter the duplicates and thats the count from the client perspective.

fabianfabian avatar Feb 02 '23 06:02 fabianfabian

That's a super good point, I hadn't thought of that. I do think COUNT could still be useful as an approximation (since it is anyway), you could take the max of multiple results, or select which relay to request a count from.

staab avatar Feb 02 '23 22:02 staab

Aaand I've changed my mind, I've decided that COUNT is not useful. If we add relay extensions we could put something together that would be more complete.

staab avatar Feb 15 '23 17:02 staab

Hi,

I'm adding COUNT to my metadata / contact indexing nodes (wss://us.rbr.bio and wss://eu.rbr.bio) for follower counts, and it was easy to implement, but I have a few questions:

Why {count: 30} when ["COUNT","subname",30] would already contain the information and is simpler? For group_by extensions it makes sense to have an array, but for simple counts I would prefer just a simple number (this is just a nitpick though).

The more interesting question: I would just like to support a few types of COUNT and group_by queries. I think the best would be if I could specify the types of queries my relay supports in the relay information document (although that should be handled in another NIP).

And the main question: how should the server reply if it doesn't handle the specific COUNT query? Should the relay just reply with a NOTICE, or EOSE, or do nothing? I guess NOTICE would be the most backwards compatible, but NOTICE doesn't contain the subscription name, as it just supports 1 message.

adamritter avatar Mar 12 '23 20:03 adamritter

One more extra question: the NIP contains just "" as the second parameter, but I guess it's a label (maybe not subscription) that helps identifying the reply. It should be clear in the documentation.

adamritter avatar Mar 12 '23 20:03 adamritter

I'm adding COUNT to my metadata / contact indexing nodes

This is exactly the place that COUNT makes sense. It's not good for direct use with multiple relays, but if you're using a relay multiplexer or indexer, COUNT can work.

To answer @adamritter's questions:

  • For simple counts I'm not using a number so that additions to the NIP can be made later without breaking backwards compatibility. One rule of thumb I've learned is always return an associative data type from APIs that aren't easily refactored.
  • I would guess NOTICE if not supported, or just ignore the request, but I don't really know.
  • The "" parameter is the subscription id, standard with REQ etc. Probably a good idea to clarify though

I'm not sure about selectively supporting COUNT attributes/groups, can you share the reason for that?

staab avatar Mar 13 '23 17:03 staab

I'm not sure about selectively supporting COUNT attributes/groups, can you share the reason for that?

I'm running 2 relay servers that hold all metadata and contacts of all relays in RAM and serves them (wss://us.rbr.bio and wss://eu.rbr.bio).

I implemented it with hash maps in RAM, I don't use any query engine.

Already added follower counts support, but it's also a specific hash map just for followers:

https://iris.to/npub1dcl4zejwr8sg9h6jzl75fy4mj6g8gpdqkfczseca6lef0d5gvzxqvux5ey

I'm also planning to implement group_by authors/pubkey just for getting the list of all followers, as it's important and supported by the data structures in my server.

I was thinking of the selective supporting because most relay implementations have secondary indices that efficiently support count for certain types of queries, but it maybe not important to require those indices to be used for group_by operations (for example the reactions are not that many so group_by is easy for reactions on the server even if there is no index for it).

To tell you the truth, I'm also not sure about this selective support (I'm trying to conform to the standard as much as possible), but what's more important is that even if we do it, it shouldn't be part of this NIP, so there's no need to take a decision.

There may be millions of group by results though, so the group_by NIP should maybe specify a limit: and returning limited number of results (relays usually have a default max limit anyways).

adamritter avatar Mar 13 '23 20:03 adamritter

The current specification doesn't specify if multiple counts are supported or not in 1 query (I think they should be supported).

Right now there's both {count: 3} and [{count:3}] as suggestions for simple counts, I prefer {count:3}, and using arrays for group by when we extend count to support group by (which is needed)

adamritter avatar Mar 13 '23 22:03 adamritter

I implemented basic group_by support for getting followers on wss://us.rbr.bio:

["COUNT","hello",
      {"#p":["85080d3bad70ccdcd7f74c29a44f55bb85cbcd3dd0cbb957da1d215bdb931204"],"kinds":[3]},
      {"#p":["85080d3bad70ccdcd7f74c29a44f55bb85cbcd3dd0cbb957da1d215bdb931204"],"kinds":[3],"group_by":["pubkey"]}]

["COUNT","hello",{"count":50538}, 
  [{"pubkey":"82341f882b6eabcd2ba7f1ef90aad961cf074af15b9ef44a09f9d2a8fbfbe6a2","count":1},...

(returns top 1000 followers by popularity)

adamritter avatar Mar 14 '23 01:03 adamritter

I've re-introduced group_by to fit @adamritter's implementation. If relays prefer not to implement it, they can raise a NOTICE or ignore the group_by key. I think we should go ahead and merge this, as there are multiple implementations in the wild, and despite the problems with merging results from a relay, COUNT can be done more reliably by either indexers or multiplexers.

staab avatar Mar 31 '23 14:03 staab

This still looks like it is not really solving any problems, specially the group_by clause, it makes no sense to me.

fiatjaf avatar Mar 31 '23 15:03 fiatjaf

Can you elaborate? Currently coracle only uses COUNT in one place, but it allows me to avoid downloading many megabytes of data to populate a single number. A couple use cases for group_by are enumerated above. Your comment that it might be burdensome to relays is valid, but it's opt-in as written, and can be useful for analytics as well as more common uses.

staab avatar Mar 31 '23 15:03 staab