DEPs icon indicating copy to clipboard operation
DEPs copied to clipboard

scaling to indefinite and changing groups of files

Open okdistribute opened this issue 6 years ago • 49 comments

Hey all,

I've been in conversations with Internet Archive to integrate Dat into their dweb portal. One of their requirements is to serve data from Dat over the peer-to-peer network without having to scan all of the files ahead of time (upwards of 50 petabytes of archives...). Opening this on behalf of @mitra42 who is project lead on the Archive's dweb portal.

The way they have approached this with other protocols such as IPFS, Gun, and Webtorrent is to have a peer that pretends that it contains everything, and then once an archive is requested, to go fetch it from Internet Archive storage and serve it on the network.

For example, they want to be able to handle requests for the string /foo/bar/baz.mp3 over the network. Quoting @mitra42:

Looking at that doc [How dat works by vtduncan], I'm struggling a bit with the route from a path. dat://12334/foo/bar/baz.mp3 to the data.

It seems, if I read it correctly, that this requires all the files to have previously been put into the metadata "file", so doesn't scale to indefinite and changing groups of files, i.e. there doesn't appear to be point (unlike in GUN for example) where the string /foo/bar/baz.mp3 is sent from one peer to the other, for that peer to then figure out what to send back.

Wondering if this is a use case that this WG think is interested to support directly in the protocol (e.g., file-aware transfers vs. only block transfers)?

Thanks,

okdistribute avatar Jan 31 '19 23:01 okdistribute

Well let's see...

The wire protocol exchanges hypercore blocks, not files. As a result, peers don't explain what files they're trying to fetch when they communicate with each other. They only explain which data-chunks they want. That doesn't help.

One option might be to update the hyperdrive data structure so that it can have "placeholder" entries. These would be empty files with a "is_placeholder" flag set. A wire-protocol extension message could ask the owning peer to "hydrate" the file, in which case it would overwrite the placeholder entry with a real entry. Then replication could happen as usual.

Another option might be to leverage one-way mounts, once they're implemented. In that scenario, the owning peer would be sharding their file-set into a tree of dats. They would optimistically create and share the dats, but not populate them -- when a peer first connects, they would then populate the dats.

Let me step through the one-way mount solution. Let's say we have the following file tree:

/alice/fun/stuff
/alice/work/emails
/alice/work/backups
/bob/fun/games
/bob/fun/pics
/bob/work/
/carla/

The top level has three folders:

/alice
/bob
/carla

At initialization, a top-level dat would be created, and then 3 more dats: one for each folder. They would remain empty. The top-level dat would mount each of them.

# toplevel dat:
/alice -> dat://alice
/bob -> dat://bob
/carla -> dat://carla

# dat://alice
(empty)

# dat://bob
(empty)

# dat://carla
(empty)

All 4 dats are swarmed. Let's suppose a reading client wants to pull down the 'bob' folder. It would request a block from the 'bob' hyperdrive. When this happens, the owning peer would immediately mint 2 more dats, the 'bob-fun' dat and the 'bob-work' dat, and mount them to the bob dat:

# dat://bob
/fun -> dat://bob-fun
/work -> dat://bob-work

# dat://bob-fun
(empty)

# dat://bob-work
(empty)

These 2 new dats would then also be swarmed. This continues recursively. Basically, we're using entire dats as placeholders, and using the request of any block as a trigger to "hydrate" it.

The good news is that one-way mounts will not require additional hits to the DHT. The replication connection for the top-level dat can multiplex all of the child mounts, making this design relatively efficient over the network. It also would not require any unplanned changes to the protocol.

pfrazee avatar Feb 01 '19 00:02 pfrazee

This is now theory land: Another way to go about this would be more "lowlevel". Every DAT consists of 2 hypercores, metadata & content. What if it were to contain X hypercores? One per folder and one per file. The special hypercore dir/ will contain a list of operations.

For a file, it might contain an entry

{ "op": "add", "file": "Readme.md", "time":12312313, "hash": "YI8Qx5b/Tpbu5Hiw72RoSduT3qL3ZgYWZEMx5QwTPZ4=", "type": "file" }

... which means that for the latest version of the Readme.md you would need to ask the server for the hypercore file/YI8Qx5b/Tpbu5Hiw72RoSduT3qL3ZgYWZEMx5QwTPZ4= to receive the files.

For a folder, it might contain en entry:

{ "op": "add", "file": "css", "time":12312313, "hash": "85FydCC60+APHJsgWHJBOISXnITmtx9dwRUC4wi+pak=", "type": "dir" }

... which means that folder the history of this folder you would need to ask the server for the hypercore: dir/85FydCC60+APHJsgWHJBOISXnITmtx9dwRUC4wi+pak= which would contain the version history for this folder.

The server would need to lookup all files in a folder and just generate the hashes and upon request of a hypercore would start creating those hypercores (like a stream placeholder).

The crux of the issue, and the reason why the webarchive can not really do what they mean to is the problem that you can't quite know how big the entire archive is (something that is prominently displayed in DAT right now).

martinheidegger avatar Feb 01 '19 00:02 martinheidegger

@martinheidegger Correct me if I'm wrong, but I think, for the most part, that's how my one-way-mounts solution works.

pfrazee avatar Feb 01 '19 01:02 pfrazee

@pfrazee It is similar, but on a lower level. Hyperdrive → Hypercore. Also it uses hashes instead of direct file lookups for deletions.

martinheidegger avatar Feb 01 '19 01:02 martinheidegger

@martinheidegger What's the upside of introducing a different hypercore datastructure than hyperdrive? If you stick with hyperdrive, you avoid having to create a niche solution.

pfrazee avatar Feb 01 '19 01:02 pfrazee

Codename for this thing: hyperfs. If you build hyperfs, the way to lookup files will be vastly different from how it currently works in hyperdrive (pre multi-writer). You need to add some structure on top of hyperdrive to do that. It seems to me that having the hyperdrive structure (and overhead) doesn't give any benefit to this: What use is they metadata core if its always the same structure? (one node in metadata declares if its a file or directory, with the content stream only in use when its a file). As the linked DAT's mean that they could be theoretically by different owners, we need to create separate replication streams for all those DATs, If they are but "subcores" we can assume (guarantee?) that they have been generated with the same write key.

martinheidegger avatar Feb 01 '19 01:02 martinheidegger

Thanks for the thinking .. I'm hoping we can use this exercise as an example of how DAT might work with a large and changing site like the Archive. IMHO (and from prior experience with how the Web built on legacy Gopher and FTP data), how well a protocol gets adopted depends in part on how well it integrates with legacy data systems, and its great to be able ot use the Archive to test out some of these ideas.

I could be wrong, but I don't think either of those scenarios work for large data sets with an unknown (and changeable) set of items. For example, the obvious structure to me would be dat://<ArchivePubKey>/<ItemID>/<FileName> , in @martinheidegger 's suggestion this has two issues a) one DAT at the top level, pointing at the others and one dat for each directory (what we call an <ItemID>) - will fail because i) the files are added to all the time (approx once per second) and ii) There would be roughly 50million item ids to keep track of. and iii) practically we can't run any process ahead of time over all the ids.

Of course - we could do this in a different way - similar to how we work around IPFS's inherent weaknesses - where we created a DAT for each item the first time a user requested it, and then shared the address of the DAT out of band (via GUN or HTTP), but with that solution, Wort, Gun or WebTorrent are always going to work better than DAT.

mitra42 avatar Feb 01 '19 01:02 mitra42

@martinheidegger You're free to build any data structure you want on hypercore, but you're going to have to get that structure adopted widely if you want it solve the problem.

@mitra42 Sharding the dataset into multiple mounted dats is going to give you a couple benefits:

  1. You can do just-in-time construction of each folder's dat, and avoid constructing dats which have no demand
  2. You can retire & swap out a folder's dat if the folder's history ever gets too large (effectively garbage collecting the history)

Are you certain that sharding the dataset to one-dat-per-folder, and only updating folder-dats that have peers, is not enough for you to scale?

pfrazee avatar Feb 01 '19 01:02 pfrazee

I wonder how we might use something like kappa-db/multifeed for this 'hyperfs' implementation.. any thoughts @noffle?

okdistribute avatar Feb 01 '19 01:02 okdistribute

@pfrazee That issue persists with both a hyperdrive and a hypercore solution. Pulling it one abstraction level higher doesn't make it quicker to be used.

@mitra42 I am seconding @pfrazee here: but I have some issues comprehending all you wrote (some formatting issues here). The only bottle-neck that I see with my example is that / can not be entirely cleared quickly (which is something worth considering to add). Since each directory/core only keeps track of is children any change operation would be influencing a disconnected part of the structure. Like IPFS I used a "hash" to specify the content hypercore, but this could also "just" be a random key to behave more like DAT does today (more direction of hyperdrive approach). The protocol of dat is built to be quicker and more efficiently than HTTP for larger files (lower overhead).

martinheidegger avatar Feb 01 '19 01:02 martinheidegger

@pfrazee i disagree that it has to be adopted widely for the archive's case, because most of the peers would be served from the client side webrtc (watching the same video at the same time, for example)

okdistribute avatar Feb 01 '19 01:02 okdistribute

@karissa you're just losing wider ecosystem compatibility. If the situation really is niche, then you can do that, but I think you're losing some of the benefits of dat. In the case of cabal the custom data structure makes total sense, but in this case hyperdrive + the upcoming one-way-mounts feature has a good chance to solve the issue.

pfrazee avatar Feb 01 '19 01:02 pfrazee

Yeah, this is why I was sort of hoping it could be something at the hypercore level -- being able to expose the filename (if known, optionally) in the request

okdistribute avatar Feb 01 '19 01:02 okdistribute

or rather, any payload for extra request content needed

okdistribute avatar Feb 01 '19 01:02 okdistribute

@karissa You could probably do that via a wire-protocol extension message. It's got some extra overhead but if it was only turned on for dats doing this style of on-demand hydration, it's not super crazy. You'll just need the peers to support the extension.

pfrazee avatar Feb 01 '19 01:02 pfrazee

@pfrazee I don't think we can come up with a solution that doesn't require the entire ecosystem to update. I believe a case could be made that the next update multiwritercough* could be that thing. But also once more these solutions are not backwards compatible! One important aspect of DAT is that we know the entire size of it at a versions time. No matter if it's mounts or different cores: we wouldn't be able to tell how big the data space is.

martinheidegger avatar Feb 01 '19 01:02 martinheidegger

@martinheidegger The one-way mounts are a planned update for the entire ecosystem, though.

pfrazee avatar Feb 01 '19 01:02 pfrazee

Planned doesn't mean done ;)

martinheidegger avatar Feb 01 '19 02:02 martinheidegger

@pfrazee asked "Are you certain that sharding the dataset to one-dat-per-folder, and only updating folder-dats that have peers, is not enough for you to scale?”

Sure - this can be done, but then you have the issue that you can’t do this on 50m directories ahead of time, so you have to do this creation of one dat-per-folder between the time the user navigates to the page and the time the metadata is returned, since you might have to rebuild a DAT with a few GB of data (the raw version of the file), so it’s very slow on first access to an item, and in fact slows it down for all users, not just those who will use DAT to retrieve the files, so in practice it won’t happen. (For IPFS we only put the actual files requested into IPFS because this step is so slow for IPFS). The Archive might be a good place to try an extension, and then role it out to the ecosystem if, and only if, it works well - we have the advantage (during the experiment phase) of controlling both the client, and the peer at the Archive.

@martinheidegger: You are definitely going to lose the “size of the entire DAT” for any large changing data set, accumulating size back up the chain isn’t practical,

Updating the top-level “/“ DAT is possible but much harder since a) I can’t crawl 50m items and b) even if I could, doing that update at item creation time requires hacking a large code-base with substantial (and reasonable) institutional resistance to change, while anything I can do on top of the existing data doesn’t face that barrier and is more open to experimentation.

I truly believe that name lookup on large (number of items) changing data sets is a crucial bottleneck, its a massive weakness in IPFS, and a big advantage in GUN and Wort. Both AT an Wort allow for “hijacking” during that name process, i.e. mapping a name from their ecosystem to a HTTP request. I think it would be great for DAT if it could think this through., for the Archive we can use GUN/Wort/HTTP to discover the DAT archive id, but that wouldn’t be was useful for testing DAT on its own scaling.

mitra42 avatar Feb 01 '19 02:02 mitra42

@mitra42 If that's your read then I'd agree that you should try the wire-protocol extension.

pfrazee avatar Feb 01 '19 02:02 pfrazee

can’t do this on 50m directories ahead of time

That is not what we are suggesting here. In pfrazee's approach you'd only serve a DAT once you have it but link it ahead of time. My approach would do the similar thing on hyperprotocol level (if the protocol asks for a key, a hook will be called to create this item - or return 0 if its not replicated/existing).

definitely going to lose the “size of the entire DAT”

This sounds to me like a more-important issue than you might think. The size allows prediction on the client how much needs to be downloaded and gives an indication of how much would need to be shared. (also you gain some optimization options by preserving this).

@mitra42 Asking a different question here: Why, one data set and not split it up and serve a lot of smaller DATs? A DAT - particularly in the next version - can easily hold few gigabytes of small files and perform well. Without breaking any infrastructure: If a central indexing DAT contains a list of DATs for sections of the web, then clicking on a link (in beaker browser) the service could index the content of that section.

@martinheidegger The one-way mounts are a planned update for the entire ecosystem, though.

I havn't seen the DEP of this yet. What was the consensus on the "loose size of entire DAT" issue (which persists with mounts, from a user perspective).


Thinking about a protocol-level approach here:

In the protocol there is no distinction between files and other data. The mirror-folder system indexes the file system and adds one-by one files to the DAT. There is currently no metadata that tells the client if a mirror-folder process is running on a folder (same like there is no information about file streaming). If this information was given, a reader could let the writer know "please prioritize this folder". And the writer could simply tell the mirror-folder process to prioritize that folder first. This seems like a hack of a system though - very fragile.

Note: I have other tasks to look after from now until tomorrow. Will answer/give ideas with delay.

martinheidegger avatar Feb 01 '19 02:02 martinheidegger

@martinheidegger - there are two possible approaches here - one DAT for the whole of the IA, or one DAT per item with a root-level DAT as an inde, both of them appear to have (solvable or unsolvable) scaling problems.

For the one DAT per item approach maybe I'm missing something - where do I find the link for a specific DAT, given that I can't run a process on 50m items to create the links for 50m (not yet created) DATs ahead of time in order to put these into the index , the scaling problem has just been pushed to a different (maybe better) point. If we could calculate the pointers for the DATs without processing the content (e.g. by a hash function on the name) then it would be easier, and wouldnt create a slowdown at the point we are requesting Archive (not DAT) metadata, though its still non-trivial to generate a list of all item ids, and even more non-trivial to map those DAT ids back to not-yet-created DAT archives.

If its one DAT per item, then certainly we can get sizes on each DAT, but obviously not on the index which will change size approximately once per second as items are created .... how in DAT are you now handling data sets which get appended to and so don't know the size ahead of time?

mitra42 avatar Feb 01 '19 03:02 mitra42

I second the idea of using a protocol extension, but I don't think you need to use mounts.

Think of it like this:

  • You have the entire internet archive on an FS of some sort
  • You start with an empty Dat representing it
  • Peers replicate the archive (sparse) and attempt to load the file they want
  • If they 404, send a message down the replication protocol using an extension to say "hey, add this file"
  • The archive peer will then add the file to the archive, and it'll become part of the swarm

This way you don't need to index ahead of time, and content is lazy-loaded into the archive. And once the file is loaded, it'll be spread across the P2P network as needed.

Some concerns:

  • You need to hope that you're connected to the archive peer when finding a new file or this extension stuff won't help (Maybe we could work around that by broadcasting?)
  • This method assumes the files are static and changes are created in new files rather than extending existing ones. If the files change you'll need some sort of process to watch those changes
  • The internet archive will slowly be duplicated within the content feed. Might need to do some spooky magic at the hypercore level to have it load those sections of the feed from the actual FS on the internet archive peer.

I believe this method would be similar to what the internet archvie is doing for other protocols and requires a minimal amount of change to the existing hyperdrive code.

RangerMauve avatar Feb 01 '19 14:02 RangerMauve

I think this is the right kind of approach,

A couple of detail points/questions.

Changes: wouldn't this work the same as DAT does now, i.e. if a change happens after a file is added then presumably the same mechanism (and I don't know how DAT does this) to handle an update would be used.

Connectivity: Could some variant of the mechanism (again I don't know how DAT does this) that is used to find which peers have the metadata/file now be used to forward the query (with the file name extension) with the peer connected to the file system eventually getting the request and loading it.

Http: Instead of connected to a File system, the peer should map the DAT to a URL request e.g. DAT/1234/<ITEMID>/<FILENAME> maps to http://dweb.me/arc/archive.org/download/<ITEMID>/<FILENAME>. (for a filesystem this would just be a URL file:/something/something)

Even better: Note that this could (in some cases) also work at multiple peers not physically at the Archive, where any peer could fetch a missing file via HTTP, this is how WebTorrent does it now - there is a HTTP field in the magnet link that allows any peer to get missing files direct from the Archive and start seeding them itself. (This is wonderfully fast in WebTorrent, which is why it is currently our preferred decentralized video player)

mitra42 avatar Feb 01 '19 22:02 mitra42

Note that for the anti-censorship case, you wouldnt want to rely on the HTTP fallback direct from each peer, as ideally if you are behind a censorwall you want to be connected to some peer that is able to connect via Http.

mitra42 avatar Feb 01 '19 22:02 mitra42

Changes: It depends on how the raw data is stored. If possible you could hook into FS events and check if the file is in the metadata feed and add the change

Connectivity: Kind of. Basically, you discover peers for the archive and open up a hypercore-protocol stream with them to start replicating append-only logs called hypercores. The protocol is set up in such a way that you could pass custom messages in addition to the ones used for hypercores. That's where you could send the request out.

HTTP: Not sure what you mean, is the data not going to be stored directly in the archive?

Even Better: Yeah, I get what you mean. However, peers are unable to add data into a Dat archive, only the internet archive could add files to the archive due to the way security works in Dat. Now that I think of it, I don't think it'd be very easy because Dat doesn't use content addressing the way WebTorrent and IPFS do.

RangerMauve avatar Feb 02 '19 02:02 RangerMauve

Changes: Might have to do that later, probably from the other direction, having something periodically crawl the archive comparing sha's against that in the current metadata. Http: Sure the data is directly in the Archive, but for most really large sites I've seen (including the Archive's 50 petabytes), that isn't a file system, its servers responding to HTTP. Even Better: yes - understood about the security.

mitra42 avatar Feb 02 '19 05:02 mitra42

i.e. if a change happens after a file is added then presumably the same mechanism (and I don't know how DAT does this) to handle an update would be used.

In the hypercore protocol you need to specify ranges of interest. If the range doesn't have a clearly defined end the client will receive always the newest available data. (add-only-protocol)

where do I find the link for a specific DAT

Well, one thing that has been part of this discussion is "mounts":

Mounts merge dats together by causing one dat to act as a folder within another dat.

(source: datproject/planning)

This means that for IA you would have one DAT that contains the links to all other DAT's and you would add one for every folder you have. As soon as someone becomes a peer to a particular DAT (and asking for additional data) you could start/resume the mirroring of that particular DAT.

martinheidegger avatar Feb 04 '19 02:02 martinheidegger

Lets be precise .... when you say "contains the links to all other DAT's and you would add one for every folder you have." Do you mean all the folders you have, or all the ones you make. This is important, because as said above you simply can't create one file with links to all the possible (50 million and changing all the time) folders/DATs.

mitra42 avatar Feb 04 '19 03:02 mitra42

@mitra it is stack-able. You could a root-set of 512 keys, each containing 512 more, each with 512 in them, which would mean 3 more look-ups for getting to the actual folder.

martinheidegger avatar Feb 04 '19 05:02 martinheidegger