nips icon indicating copy to clipboard operation
nips copied to clipboard

Nostr apps bring data subscription fast to there monthly limit

Open Saibato opened this issue 2 years ago • 6 comments

Has anyone thought about smart filters or is there a NIP? I.e. clients can ask relays i.e. using bloomfilters to not double fetch. i saw @mmalmi suggest that, to reduce extensive data-flow?

Saibato avatar Jan 06 '23 19:01 Saibato

I'd suggest a filter like { not: { ids: bloomFilter }, ... } where the bloom filter contains the event ids you already have.

Will use in Iris if someone has time to implement it on the relay side.

mmalmi avatar Jan 07 '23 10:01 mmalmi

Another filter suggestion: { fields: ['id'], ... } so you get only the ids of the events that match your query.

Then you can 1) ask for only the events that you don't already have and 2) ask only one relay at a time. That's useful when you're connected to 10 relays and they might all send you the same events. You could save up to 90% bandwidth.

Bitcoin does something similar by advertising new transaction and block hashes with inv messages and only sending the full item in response to a getdata request.

mmalmi avatar Jan 07 '23 10:01 mmalmi

Bloom filters can have false positives, so they might contain IDs that they are not intended to contain.

zzo38 avatar Jan 07 '23 20:01 zzo38

Bloom filters can have false positives, so they might contain IDs that they are not intended to contain.

Yes, but you can set the false positive rate low enough that it doesn't matter in "send me a history of everything" type queries. Whereas in DMs for example you might not want to miss even one in a million. And for future message subscriptions it would be counterproductive.

mmalmi avatar Jan 07 '23 20:01 mmalmi

yeah, no best compromises, i test Cukoo filter too right now, but bloomfilter are more common to understand, so the preferred way; will post something simple in py, quick and dirty.

With so much data on relays in the future the filtersize might be a problem? Outbound data count too.

I also thought about a set of keys that pre sort the relay database, it's highly unlikely to find fast collisions, when using just 8 random or fixed bits from an id hash as additional key, same i guess as using a fast trap function with shorter bit length than sha256 for the bloom bits. That might reduce the filter length, and processing time. But such tweaks are probably not suitable for a wider accepted general sketch, howto do filters like that.

{ not: { ids: bloomFilter }, ... }

i guess i will have a simple relay that can understand that tag just for testing soon ready. Just for fun and learn some nostr and iris,

You could save up to 90% bandwidth

I.g. i guess also the band-wide can be highly reduced, if the meta, like profile pic links and pictures are prioritized to filter out doubles. and thereby preventing double external net fetches on the client side.

Some measures could also be done on clientside? To prevent spam inbound of the open websocket, after the initial filter post, when successive posts are expected, one can assume that there will be spammy relays. Just brainstorming here :shrug: , no idea how much of that is already done.

Saibato avatar Jan 07 '23 21:01 Saibato

Since we do not want to miss a single id, i might have found a fast and working sketch. Classic blooms or CoKoo we might not need, since the id is already a good hash with enough entrophy and by changing the primes on the fly we can build fast new filters to mitigate against initial short_id collisions. so after some short time ids that maybe missing from the lead relay, find on the sub relays. on new REQ calls In a real scenario those random primes created on the fly could be variable in length and relay to save more, or longer to mitigate against collisions. So no relay get's the same filter. in essence this filter build just short_ids from the 32Byte ID hash to save space, since we could save only true bandwide, if the outbound data is small enough times n relays calls. With 25 bits in the sample patch, we probably can call up to 40 relays before it wont make any difference, and with just 5 or ten we shoud save roundabout 90%

So if the first relay is the main one we call for bulk messages, we see the others only as backups, that need only to give us events we do not already have. The sketch works easy the other way around when relays would signal themself new prime filter.

I guess if this is tested a bit, a new NIP-XX could be created to advertize such filters to relay, as option for ppl who care about bandwide usage.

here a diff to the source of the server/client nostrpy that has to be patched for testing fast short prime filter https://github.com/monty888/nostrpy

it will expose when patched a relay understanding in the REQ connect { not: [{'filter': ['<IDS_mod_prime_in_hex>' , ..], 'mod' : <random_primenumber_in_hex> }]

on localhost:8081

the filter could also be later a bitfield, but for now a lsit

git clone https://github.com/monty888/nostrpy cd nostrpy python3 -m install -r requirements.txt

copy prime filter patch see click me into new file patch.txt

patch ./patch.txt

and then run
python3 relay_mirror.py to get some data and then python3 run_relay.py to local serve them and in a second console python3 cmd_event_view.py

to see the filter building up by the next start of the relay it will only send event not matching that filter.

Click here :+1: to see the patch
diff --git a/cmd_event_view.py b/cmd_event_view.py
index 9eec93a..c620dc8 100644
--- a/cmd_event_view.py
+++ b/cmd_event_view.py
@@ -23,6 +23,8 @@ from nostr.event.event import Event
 from nostr.encrypt import Keys
 from app.post import PostApp
 from cmd_line.util import FormattedEventPrinter
+import random
+from sympy import isprime

 # TODO: also postgres
 WORK_DIR = '/home/%s/.nostrpy/' % Path.home().name
@@ -31,9 +33,9 @@ DB_FILE = '%s/tmp.db' % WORK_DIR
 # RELAYS = ['wss://rsslay.fiatjaf.com','wss://nostr-pub.wellorder.net']
 # RELAYS = ['wss://rsslay.fiatjaf.com']
 # RELAYS = ['wss://relay.damus.io']
-RELAYS = ['wss://relay.damus.io','ws://localhost:8081']
+# RELAYS = ['wss://relay.damus.io','ws://localhost:8081']
 # RELAYS = ['wss://nostr-pub.wellorder.net']
-# RELAYS = ['ws://localhost:8081']
+RELAYS = ['ws://localhost:8081']
 # AS_PROFILE = None
 # VIEW_PROFILE = None
 # INBOX = None
@@ -55,6 +57,9 @@ usage:
     sys.exit(2)


+FILTER = [{'filter': [], 'mod' : 0x1fffffff }]
+
+
 class ConfigException(Exception):
     pass

@@ -307,6 +312,8 @@ def run_watch(config):
             'since': util_funcs.date_as_ticks(since),
             'kinds': [Event.KIND_TEXT_NOTE, Event.KIND_ENCRYPT]
         }
+        e_filter['not'] = FILTER
+        # [{'filter': ['290253260',' 507307490','357358348'], 'mod' : 0x1fffffff }]
         if until:
             e_filter['until'] = until
         # note in the case of wss://rsslay.fiatjaf.com it looks like author is required to receive anything
@@ -332,6 +339,16 @@ def run_watch(config):
                                      share_keys=share_keys)

     def my_display(sub_id, evt: Event, relay):
+        c_evt = evt
+        mod_v = 0x1fffffff
+        for mod in FILTER:
+            mod_v = mod['mod']
+        h = int(c_evt.id,16) % mod_v
+        print ('data testdebug %s %d' % (c_evt.id, h))
+
+        for filter in FILTER:
+           filter['filter'].append(str(hex(h)))
+
         my_print.print_event(evt)

     my_printer.display_func = my_display
@@ -383,6 +400,15 @@ if __name__ == "__main__":
     # logging.getLogger().setLevel(logging.DEBUG)
     util_funcs.create_work_dir(WORK_DIR)
     util_funcs.create_sqlite_store(DB_FILE)
+    #gen a filter prime of n bits
+    n = 25
+    prime = 4 #:-)
+    while not isprime(prime):
+        prime = random.getrandbits(n)
+        prime |= (1 << n)
+
+    for mod in FILTER:
+      mod['mod'] = prime
     run_event_view()
     # client = Client('ws://localhost:8081').start()
     # client.query([{'kinds': [0], 'authors': []}])
diff --git a/nostr/event/event.py b/nostr/event/event.py
index 6ddb143..a063f52 100644
--- a/nostr/event/event.py
+++ b/nostr/event/event.py
@@ -211,6 +211,7 @@ class Event:

         return ret

+
     def __init__(self, id=None, sig=None, kind=None, content=None, tags=None, pub_key=None, created_at=None):
         self._id = id
         self._sig = sig
diff --git a/nostr/relay/relay.py b/nostr/relay/relay.py
index 5030857..83b96a5 100644
--- a/nostr/relay/relay.py
+++ b/nostr/relay/relay.py
@@ -52,6 +52,7 @@ class Relay:

     """
     VALID_CMDS = ['EVENT', 'REQ', 'CLOSE']
+    not_ar = None

     def __init__(self, store: RelayEventStoreInterface,
                  accept_req_handler=None,
@@ -60,7 +61,8 @@ class Relay:
                  description: str = None,
                  pubkey: str = None,
                  contact: str = None,
-                 enable_nip15=False):
+                 enable_nip15=False,
+                 not_ar = None):
         self._app = Bottle()
         # self._web_sockets = {}

@@ -313,8 +315,13 @@ class Relay:
             'id': sub_id,
             'filter': filter
         }
-
+
         logging.info('Relay::_do_sub subscription added %s (%s)' % (sub_id, filter))
+        for c_myarr in filter:
+         if 'not' in c_myarr:
+           self.not_ar = c_myarr['not']
+           logging.info('Relay::_do_sub not primef %s' % (self.not_ar))
+

         # post back the pre existing
         evts = self._store.get_filter(filter)
@@ -354,12 +361,31 @@ class Relay:
     def _do_send(self, ws: WebSocket, data, lock: BoundedSemaphore):
         try:
             with lock:
-                ws.send(json.dumps(data))
+                 ws.send(json.dumps(data))
         except Exception as e:
             logging.info('Relay::_do_send error: %s' % e)

     def _send_event(self, ws: WebSocket, sub_id, evt, lock: BoundedSemaphore):
-        self._do_send(ws=ws,
+        c_evt=evt
+        send_it = True
+        pf = self.not_ar[0]
+        for test in self.not_ar:
+            if 'mod' in test:
+                mod_n = test['mod']
+
+        h = int(c_evt['id'],16) % mod_n
+        for da in pf['filter']:
+            b = int(da,16)
+            if h == b :
+                send_it = False
+                logging.info('Relay::_do_send data match %s %s' % ( self.not_ar, hex(h)))
+            else:
+                logging.info('Relay::_do_send data nomatch %s %s %s' % ( self.not_ar, hex(h), hex(b) ))
+        #logging.info('Relay::_do_send data testdebug %s %s %d' % (c_evt['id'], self.not_ar, h))
+        #input('return')
+
+        if send_it:
+             self._do_send(ws=ws,
                       data=[
                           'EVENT',
                           sub_id,
diff --git a/requirements.txt b/requirements.txt
index b525244..1f28da8 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -50,6 +50,7 @@ six==1.16.0
 soupsieve==2.3.2.post1
 spake2==0.8
 stem==1.8.0
+sympy==1.10.1
 toml==0.10.2
 tqdm==4.64.0
 Twisted==22.4.0

`
  

As of now the easy way to save band-wide is to just use few relays but that will then centralize fast for mobile users the choices.

Saibato avatar Jan 08 '23 22:01 Saibato

I've been thinking a lot about bloom filters too. This should address some of the motivation behind #515. It might even have way more impact than another encoding ... I would guess a lot of the events clients receive are duplicates, yet to determine they are duplicates the JSON still needs to be parsed.

If nostr is working well, duplicates are expected. The more duplicates the more censorship resistance. If I on average get 5 duplicates for every event I query for, bloom filters would be an 80% performance improvement.

huumn avatar May 13 '23 15:05 huumn

I have a feeling that clients only get so many duplicates because they ask all relays for all the things all the time, instead of treating relays differently and asking each relay for different sets of events per public key.

fiatjaf avatar May 13 '23 16:05 fiatjaf

instead of treating relays differently and asking each relay for different sets of events per public key

True but this sounds difficult to communicate/encourage.

huumn avatar May 13 '23 18:05 huumn

Before jumping to solutions, the following questions come to mind. What kind of data usage have you seen, with which apps, and for how many followers? What's the split between notes and referenced images/video?

I've seen Amethyst pulling 64GB of data since May 1st with one paid relay plus the default ~20 relays and settings it ships with.

Would be a great topic for a survey.

weex avatar May 19 '23 21:05 weex