bitmagnet Discovered nodes filtering isn't strict enough

[x] I have checked the existing issues to avoid duplicates
[x] I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue

Describe the bug

The DHT crawler filters newly discovered nodes before forwarding them to the rest of the crawler pipeline. Unfortunately, this filtering is based on on the table.addrs address map, which doesn't contain all nodes seen by the crawler recently.

This results in the crawler attempting to scrape the same nodes 50+ times in a few minutes from my testing. In the long run, only 70% of nodes that the discovered_nodes component forwards to the rest of the pipeline are actually new nodes.

To Reproduce

Steps to reproduce the behavior:

N/A

Expected behavior

The filtering should be a bit more advanced. I implemented a very naive replacement that just maintains a set of all discovered nodes and skips nodes that have already been seen. This resulted in a ~80% increase in discovered torrents per unit of time.

This implementation is far from ideal. For one, it leads to an always growing memory consumption, eventually resulting in OOMs in my setup. For two, it doesn't discriminate based on where the node eventually was forwarded to - it might be interesting to sample infohashes for a node that was previously just pinged. Finally, nodes should not be excluded forever just because they were visited once; an actual implementation should have a timeout after which a node can be visited again.

Environment Information (Required)

Bitmagnet version: latest git
OS and version: Linux 6.14.10-arch1-1
Please specify any config values for which you have overridden the defaults: the longer running tests were run with dht_crawler.scaling_factor = 20, but I observed similar results with other scaling factors during shorter runs too.

Jun 22 '25 15:06 abitofevrything

I am opening to contributing a fix for this issue, would just appreciate some guidance on what a more advanced filtering system could look like.

Jun 22 '25 15:06 abitofevrything

For a cache with expiry I have been using this ttlcache
for managing size of counts map and enabling visiting a node again after timeout, this could be used as a drop in replacement for the map.
I think that the logging (fmt.Printf()) should be through the logger

I always run a custom build of bitmagnet. Happy to work with you on this improvement. From your analysis could be quite significant

Jun 24 '25 11:06 rraymondgh

I believe that this issue actually stems from a very small omission in the kTable factory failing to set the peer ID for a given address when it is added to the table for the first time (it only currently does so when a node updates its address). Running some tests at the moment.

This cause wouldn't address the issue of reaching out multiple times to failing nodes, but it would prevent reaching out to the same node and getting the same results multiple times.

Jun 24 '25 11:06 abitofevrything

As for your comments, all of my code on my fork was mostly just thrown together for debugging - it was never anything serious. An actual implementation similar to what I was doing would probably be based on something like a lossy hash table (also known as an "inverse bloom filter", only gives false negatives as opposed to false positives) that stores the last time at which a node was visited. Unfortunately that would require a custom implementation as I can't find one already made for Go, so I'm trying to avoid that.

Jun 24 '25 11:06 abitofevrything

If you'd like to help debug, you can run https://github.com/abitofevrything/bitmagnet/tree/temp/track-new-node-counts for a while and let us know what the ratio tends to over time :)

Jun 24 '25 11:06 abitofevrything

Thanks, you're right we shouldn't be using the ktable to keep track of seen nodes for the crawler. There's a fairly simple fix, I'd probably use a stable bloom filter like the one already used for tracking seen hashes. This will maintain constant memory usage and has the desirable property that nodes will eventually fall out of it so aren't "banned" forever. This isn't a big job but I'm in the middle of a refactor so suggest coming back to this later.

Jun 24 '25 17:06 mgdigital

The issue with a bloom filter is that we risk skipping nodes we haven't processed before. I'd rather err on the other side, potentially revisiting nodes but not skipping any nodes we shouldn't.

I have an implementation using LRU caches to achieve this; I'm just not getting the drastic improvements on torrent counts I saw before so I'm not certain whether to PR in yet.

You can see that the count does have an upward trend with the LRU cache skipping, which does end up pushing it to higher numbers than the current implementation, but it's not quite as drastic as it was previously.

Jun 24 '25 17:06 abitofevrything

We risk skipping nodes we haven't processed before

The risk is so small as to be negligible, and they will likely get a second chance anyway.

Jun 24 '25 17:06 mgdigital

If you'd like to help debug, you can run https://github.com/abitofevrything/bitmagnet/tree/temp/track-new-node-counts for a while and let us know what the ratio tends to over time :)

20 hours of running:

$ docker ps -f id=830084e858d7
CONTAINER ID   IMAGE             COMMAND                  CREATED        STATUS        PORTS                                         NAMES
830084e858d7   bitmagnet:local   "bitmagnet worker ru…"   20 hours ago   Up 20 hours   0.0.0.0:3333->3333/tcp, [::]:3333->3333/tcp   bitmagnet

$ docker container logs bitmagnet  | sed -n '0~100000p'
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 60499/99705 (0.607)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 111810/199685 (0.560)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 159050/299685 (0.531)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 201949/399685 (0.505)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 241843/499685 (0.484)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 279485/599685 (0.466)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 315078/699685 (0.450)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 348821/799685 (0.436)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 380937/899685 (0.423)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 412124/999685 (0.412)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 441481/1099685 (0.401)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 470315/1199685 (0.392)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 498017/1299684 (0.383)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 525071/1399684 (0.375)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 551267/1499684 (0.368)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 576852/1599684 (0.361)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 602174/1699684 (0.354)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 627719/1799684 (0.349)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 653702/1899684 (0.344)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 679600/1999684 (0.340)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 705139/2099684 (0.336)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 730561/2199684 (0.332)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 755026/2299684 (0.328)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 779717/2399684 (0.325)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 804248/2499684 (0.322)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 828699/2599684 (0.319)
INFO	dht_crawler	dhtcrawler/discovered_nodes.go:65	New node ratio: 852823/2699684 (0.316)

Jun 25 '25 14:06 rraymondgh