magnetico icon indicating copy to clipboard operation
magnetico copied to clipboard

Enhancement: Decentralized index?

Open ProphetZarquon opened this issue 6 years ago • 15 comments

Could it be feasible for magnetico* to split its index into multiple buckets & then share half those buckets with each neighboring magnetico* instance?

If I'm running 3 crawlers on one LAN, it'd be nice if they could distribute their indexes between each other, so no one crawler has to retain all the results!

This is essentially what the BitTorrent mainline Distributed Hash Table does with peer IDs, right? Keeps a table of other known peers responding to a given hash, & shares part of that table with other peers, so that each peer needs only part of that table from each neighboring peer?

I'm proposing that the magnetico* db itself could be implemented as a second DHT, indexing hashes-->filemetadata rather than just the old DHT's peers-->hashes index.

Obviously legacy clients without magneticow's crawling ability wouldn't share this additional type of table, but how hard would it be to have magnetico* instances share a sort of secondary DHT index layer with each other?

How big can the index get, anyway? (Big enough to be worth splitting up the work, I might wager!) Any feedback on this concept is welcome, as this project is still entirely new to me & it seems very interesting!

ProphetZarquon avatar Jan 10 '18 10:01 ProphetZarquon

maybe advertise the sql db through dht as a torrent with a standard name so that when it is discovered it could be downloaded and merged into your local copy?

donaldsteele avatar Jan 17 '18 17:01 donaldsteele

That sounds at least feasible. Couldn't the db be split in some sensible fashion, so that each crawler doesn't have to retain the entire thing? Seems like having various magneticod instances query each other should produce better results than any crawler can achieve on its own. What about sorting the db contents by each .torrent filename & creating torrents of each fractional db for filenames A, B, C, & so on? (There's probably a better way to split them for quick results... Hash is probably better than filename.) If the db pieces can be stored in some consistent way, a given .torrent of filenames/hashes could be retained & shared by many crawler instances, reducing duplication of effort. Surely there's some logical way all these crawlers could be answering each other's queries in a dynamic fashion?

ProphetZarquon avatar Jan 19 '18 08:01 ProphetZarquon

It could be a good feature that covers these use cases:

  • A new magnetico host has been started, instead of starting from a blank torrent database, he could download torrents from few other magnetico instances
  • A magnetico instance has aggregated a lot of torrents, and with that, comes the risk that the instance closes, losing all the torrent. Being able to download his torrents times to times could limit the loss.

alcalyn avatar Apr 16 '18 11:04 alcalyn

Maybe it's silly, but I think it would be great if the database of .torrents could be exported to a .torrent containing scraped .torrent files & an index of the files within them. Ideally, I think it'd be a Merkle hash torrent, not some 2MB-per-piece monster.

ProphetZarquon avatar Jul 30 '18 12:07 ProphetZarquon

I'm not sure how a torrent could contain some dynamic content.

I'm thinking about, like Peertube can do, follow another instance of magnetico, and let both instances synchronize themselves. In the case of magnetico, torrents discovered by an instance could be forwarded to following instances.

alcalyn avatar Jul 30 '18 13:07 alcalyn

@ProphetZarquon one other thing you could do is share the database through nfs or cifs and just point all 3 instances to that 1 db location.

donaldsteele avatar Aug 13 '18 14:08 donaldsteele

@donaldsteele , see https://stackoverflow.com/questions/9907429/locking-sqlite-file-on-nfs-filesystem-possible#9962003 about using SQLite3 over NFS.

ad-m avatar Aug 13 '18 18:08 ad-m

@alcalyn I'm not sure dynamic content would be necessary: Torrents do have a relatively short half-life, but not that short. I think a series of list-torrents appending the previous list(s) could include new listings & listing deprecations as well.

Since the keyspace is rarely if ever reused (& reuse would cause collisions anyway) I don't think there's any reason to de-list a torrent unless it is "nuked" (deprecated by the initial seeder in favor of a new torrent), or the indexer is running out of storage space; which is one of the things a decentralized index could help alleviate.

Certainly a means of "following" a given indexer to acquire the new list-torrents would be necessary, but there are many ways to implement that; existing torrent clients already follow each other by client-ID anyway... Perhaps a signed listing would enable following across client-ID changes?

At any rate, I don't think mutable torrent contents are actually necessary for the function of sharing indexes from one indexer instance to another.

That said, I'm also not sure such a torrent-based transaction is necessary, given the implications of BEP 51.

ProphetZarquon avatar Sep 19 '18 14:09 ProphetZarquon

I'm interested to hear @boramalper 's take on this, especially considering that BEP 51 functionality is on the list of planned improvements for magnetico.

Bora, do you think a distributed index shared among magnetico instances could be practical or desirable? It seems like a great way to reduce redundant trawling work, to me. (I don't want to add it as a suggested feature unless someone has a reasonable suggestion for how to implement it.)

ProphetZarquon avatar Jan 16 '19 22:01 ProphetZarquon

Provisios:

  1. I'm still learning my way around magnetico
  2. I'm well aware you can't solve everything with a blockchain

BUT (feel free to eviscerate this idea)

What if collected hashes were written to a blockchain? Seems to me that a good deal of time is spent actually getting the hashes, before the .torrent. If hashes were stored on a blockchain, their record would be right there for the taking, at least accelerating the process. It might also be possible or desirable to store hashes+metadata.

Trust is an issue

I haven't come up with a concrete way of fighting spam/flooding, which would seem like the most obvious flaw here.

Well, you could add an economy to create trust, but how would that work?

I've thought of an upvote/downvote system based on hashes, but, well, all this gets pretty complex, and you run into nasty problems like proving unique users, and so on. User reputation would probably matter (again, for spam prevention.) Watching the terminal output of magneticod scroll by, it really strikes me just how massive of a repository of human knowledge is out there in the form of torrents. In my view, anything that helps people access this knowledge (like magnetico!) is a good thing.

How could it work?

  • The blockchain stores hashes, and possibly metadata (metadata storage would speed searches but cause bloat and may pose issues to those running validators)
  • Trawlers submit unique content to the network. Validators validate that it's not just some empty string. They (or some percentage of them) all need to agree that it's not just an empty string.
  • User up/down votes inform users of content, and could pay out to the trawlers who added the content. Really it's not about payment though, it's about signaling quality vs crap.
  • Votes are somehow limited to prevent bot abuse. (see: steemit (no it wouldn't be good to use exactly their method, but hey, it's a good starting ground for ideas, and of course any method will be flawed in some way)
  • Those whose content is upvoted most, get stronger votes than others, because they've been proven to support quality.

Once all existing hashes are added to the chain, this is no longer an automated process for trawlers.

Creators of new content add their hash, and others can vote on it.

This can all be streamlined by taking advantage of webtorrent and markdown.

Markdown pages can contain the hashes of phots, videos, music-- eg all the content that makes up a multimedia experience. These markdown pages can of course themselves be hashes that can be voted on.

I think that the "storage" blockchain part can be pretty minimal. Actually it's just the anti-bot and anti-spam stuff that makes it challenging (and at that point it is indeed very formidable.)

Please criticize all of this brutally, it's the only way it'll ever improve.

I'm open to any and all ideas, except seignorage/premine/founders reward/ICO.

If you need seignorage/premine/founders reward/ICO there's more than enough BTT to go around.

faddat avatar Feb 06 '19 00:02 faddat

As an alternative to building trust models for trusting other node's richer data. I think a simpler model is to collect the hashes only using a dedicated network between trawlers. Then your node is required to query this hash in the actual DHT. If metadata can be retrieved, you have verified the hash. While this does not share measured info like popularity, it's as minimal in trust as you can go.

This will still give you considerable warm-up time for a new node. However, it would mean all trawlers essentially collaborate on discovering what hashes are out there. And your warm-up time would be more focussed on querying the DHT for 100.000s of unverified hashes and filling up your seach indexes, rather than saturating your network having your trawling whiskers out to even know those hashes exist.

Aside from the trust being almost a non issue here, I believe another benefit is that this combats stale data and makes submitting new hashes easier. Nobody is required to trust you, so if you have new hashes, just share them and make sure a DHT query will work. And stale data wise, if a hash is in such a zombie state that a DHT query does not work, eventually no node will have ever verified it and it'll stop being shared. While in a blockchain it would grow forever.

This may not be what you want for archiving purposes, but for an end-user looking to download something, it would be great if a DHT query for the magnet you're offering actually can be resolved. So I believe for that usecase it's sufficient.

One possible technology to create a trawler network would be using IPFS / libp2p. Using it's pub-sub to share merkle tree root hashes with participants and some clever schemes to group data you should be able to keep up with other nodes using relatively low bandwidth.

Beanow avatar Mar 03 '19 13:03 Beanow

I think a simpler model is to collect the hashes only using a dedicated network between trawlers.

Exactly this! I feel certain that a secondary magnetico DHT network containing hashes-->filemetadata (complementary to the existing DHT of peers-->hashes) could dramatically reduce duplicated effort & improve database resiliency, leveraging the power of distributed tables with minimal trust requirements.

I hope somebody implements this, especially given all the recent interest in running concurrent instances, database abstraction, & import\export! Wouldn't a magneticow DHT network fulfill an awful lot of prospective use cases? @Glandos any thoughts?

ProphetZarquon avatar Sep 18 '19 06:09 ProphetZarquon

In my humble opinion this is a great avenue for a not-yet core contributor to explore. See if you can:

  • Set up a dedicated network for hashes (perhaps using libp2p)
  • Hack in some IPC call to send hashes to magnetico

And submit that as a proof of concept?

Beanow avatar Sep 18 '19 10:09 Beanow

I agree that this would be a great jumping-off point for a non-core contributor; unfortunately I can't even run magnetico* right now, because 1) I am on Android 99% of the time, with one networkless Windows PC doing media playback duty several days a month, & 2) I have no reliable internet access, over-using a T-Mobile cellular plan on a tower that is so overloaded that all requests time out during 4-to-10 hour period half of all weekdays.

My interest is largely academic at this point, as there are no plans to bring affordable broadband (less than $245\mo) to my area in central Denver Colorado. Since my only WAN-connected device is the Android tablet (\phone) I'm currently replying from, I am unlikely to be running magnetico* again any time soon. (I do torrent via cellular on the tablet, since my connection is not typically stable enough for streaming; Sadly no one's client needs my paltry upload speeds, so my torrent usage is essentially all leech. It sucks & it makes me sad.)

I do have a few friends who are proficient coders (which I am not) but they don't torrent due to unfamiliarity. If I can get better internet, or if I can interest one of those friends in taking up this project I will certainly do so, but until some personal miracle occurs for me, anyone capable of DHT-related coding should probably make their own attempt at kludging together a distributed trawler.

I am just a lowly technician wishing such a thing existed. The code I've found makes sense to me theoretically, but I have no experience with anything newer than C++. I've managed to get a magnetico* instance installed a few times, but never accumulated a DB; it looks like perhaps I'd actually been trying a broken build but I wouldn't really know.

Sorry I'm not more help. I know asking for things that I wouldn't know how to do myself is quite presumptuous; I felt it was worth asking since I've been expecting such a thing for over a decade & the closest I've seen is this project, & Tribler (which uses a non-standard expanded DHT lacking the content found in the mainline BitTorrent-compatible public swarms, but does accomplish a searchable distributed index of the comparatively paltry content it has). It seems clear to me that a secondary hashes-->filemetadata DHT is possible, & I think it's even a worthwhile approach; unfortunately I have few of the necessary tools to work on such an implementation myself. :/

ProphetZarquon avatar Sep 21 '19 00:09 ProphetZarquon

To be as clear as possible here, I don't know that an extensive trust system (blockchain/DAG) is strictly necessary for decentralizing magneticod's index as it stands now... My suggestion remains potentially simple by comparison: Store magneticod's DB as a secondary DHT among each magneticod instance. (Rating, tagging, & otherwise managing each torrent can be a task for another tool. For the life of me I don't see a DHT poisoning challenge in this that isn't already faced & surmounted by the existing peers-->hashes DHT...) Sadly I can't be the one to develop this as I'm rarely internet connected by anything but one humble Android device. Can anyone else offer insight as to the DHT coding that would be required for a secondary hashes-->filemetadata distributed database?

ProphetZarquon avatar Mar 29 '20 09:03 ProphetZarquon