datproject-discussions icon indicating copy to clipboard operation
datproject-discussions copied to clipboard

dat-index

Open max-mapper opened this issue 10 years ago • 2 comments

given that so much scientific data is on ftp/http open directories, it would be nice if we had a tool that could take a list of files, or a root + do the spidering/traversal, and index them in dat such that the blobs appear to exist in dat but they are actually only stored in the original location

pros:

  • will allow a lot of data to become available without much effort
  • you don't have to copy the blobs into the dat object store, so it requires minimal resources to deploy

cons:

  • blobs won't be versioned
  • 404s could happen (until we have IPFS, that is)
  • limited by throughput of original data source

when blobs are indexed dat basically acts as a proxy. when you index a file dat should probably hash the file and store the hash in it's metadata. when you replicate indexed data there should be an option to do a replication with blobs or without blobs

this can definitely be written as a standalone CLI tool outside of dat for experimentation purposes

we can do cool stuff like e.g. dat-index --watch which would update the index metadata of a folder whenever files are changed

max-mapper avatar Jun 27 '14 20:06 max-mapper

@maxogden Hi Max, just wondering -- did this idea ever get further developed at all? Currently we're doing a lot of fetching/syncing with ftp://ftp.ncbi.nlm.nih.gov, and something like this would be incredibly useful.

transcranial avatar Mar 04 '16 14:03 transcranial

@transcranial Hi! I've been making slow progress. The design of Dat itself has evolved a lot since I opened this issue, but I still think the idea of this issue is still very accurate and actually would be a lot easier to today as compared to Dat 1.5 years ago.

I've been working on a crawler: https://github.com/maxogden/electron-microscope

Also I should mention @bmpvieira has some NCBI specific tools under the https://github.com/bionode project.

Right now Dat can do a static snapshot of a version of a set of files, but for file sets that change we are still working on a "dynamic" mode for Dat where you get a single Dat link but can subscribe to data changes. Currently Dat links only describe the exact files at the time you create the link.

You should definitely hang out in our Gitter room, or the Code for Science room as well, there are some good discussions related to this topic happening in there lately

https://gitter.im/codeforscience/community https://gitter.im/datproject/discussions

max-mapper avatar Mar 04 '16 20:03 max-mapper