hyperdrive
hyperdrive copied to clipboard
slow to get file lists on archives with many files
On some archives which have a lot of files, hyperdrive can be very slow to perform many operations. For example, this program hangs for a long time before it can print out the contents of meta.json:
var hyperdrive = require('hyperdrive')
var hyperdiscovery = require('hyperdiscovery')
var memdb = require('memdb')
var drive = hyperdrive(memdb())
var link = Buffer('04ed0b08ff595a992a594ad1ab624072646467ec7eda2dc40e4aa512e49cb196','hex')
var archive = drive.createArchive(link, { sparse: true, live: true })
var sw = hyperdiscovery(archive)
archive.createFileReadStream('meta.json').pipe(process.stdout)
Running dat-ls on this archive is also very, very slow (about ~30 minutes). Granted, 04ed0b08ff595a992a594ad1ab624072646467ec7eda2dc40e4aa512e49cb196 has over 220k files, but it would be very nice if dat could match the performance of ipfs here, where ipfs cat QmXJ8KkgKyjRxTrEDvmZWZMNGq1dk3t97AVhF1Xeov3kB4/meta.json and ipfs ls QmXJ8KkgKyjRxTrEDvmZWZMNGq1dk3t97AVhF1Xeov3kB4 finish pretty much immediately. Another nice thing about the ipfs approach is that you can walk the directory tree iteratively without pulling down the entire directory structure first. That will make browsing archives with many files nicer and should speed up applications that mount an entire file system with hundreds of thousand or millions of files.
For my own purposes, the archive 04ed0b08ff595a992a594ad1ab624072646467ec7eda2dc40e4aa512e49cb196 (peermaps) is a nested directory structure of gzipped o5m data and meta.json files with a recursive routine to determine which directories to list. archive.list() is not well suited for this problem since all the metadata gets synced before hyperdrive can resolve named files and because archive.list() returns all the files in the archive where something more akin to a fs.readdir() would be a whole lot more useful for the peermaps use case.
As for how to implement this, I have fewer ideas. Maybe treat each directory more like its own archive? Store the contents as an updateable merkle tree? Something else?
Yeah there's some history to this. I started work on this pr to make the list() API work more like you'd expect (just in terms of behavior, not performance). We ran into some challenges that all pointed toward the need for some internal rewrites. @mafintosh had a plan to fix it and also create an internal index which stores the file listing with fast lookups -- right now, it does a full linear scan, which is why your performance is subpar. His internal rewrites, afaik, are still in progress, and just delayed by other tasks.
Another nice thing about the ipfs approach is that you can walk the directory tree iteratively without pulling down the entire directory structure first.
That won't be doable by an internal index, obviously.
Also @substack it's an interesting idea to do recursive archives instead of an internal folder concept, like IPFS (basically) did. I wonder what the downside would be
@pfrazee roundtrips, harder to digest into a bitfield
not saying it's a bad idea at all, just tradeoffs
@mafintosh yeah so even with optimizations about network lookup (eg we assume a peer serving the parent archive will have the subarchives, rather than hitting the discovery network each time) you'll still need to have 1 RT to an open connection for every subarchive you request, whereas the current system just streams the full listing
Yea. After talking to @substack on IRC there is a bunch of very obvious bugs in the current code with dealing with semi large metadata. I'd want to just fix them first and see what kind of speed we get then. Even substacks metadata is only ~10mb so it should be no where this slow
@mafintosh that makes sense, but fwiw 10mb of metadata is brutal for beaker's browsing usecase
@pfrazee i'm not saying we should fetch 10mb of metadata. I'm saying fetching 10mb should never take 30mins like substack is describing so clearly there is a regression is all.
@mafintosh yeah definitely
8 is waaaaaaaay faster at this now since it supports lookups in sparse metadata. Gonna keep this open so you can verify at some point