datproject-discussions
datproject-discussions copied to clipboard
dat-index
given that so much scientific data is on ftp/http open directories, it would be nice if we had a tool that could take a list of files, or a root + do the spidering/traversal, and index them in dat such that the blobs appear to exist in dat but they are actually only stored in the original location
pros:
- will allow a lot of data to become available without much effort
- you don't have to copy the blobs into the dat object store, so it requires minimal resources to deploy
cons:
- blobs won't be versioned
- 404s could happen (until we have IPFS, that is)
- limited by throughput of original data source
when blobs are indexed dat basically acts as a proxy. when you index a file dat should probably hash the file and store the hash in it's metadata. when you replicate indexed data there should be an option to do a replication with blobs or without blobs
this can definitely be written as a standalone CLI tool outside of dat for experimentation purposes
we can do cool stuff like e.g. dat-index --watch
which would update the index metadata of a folder whenever files are changed
@maxogden Hi Max, just wondering -- did this idea ever get further developed at all? Currently we're doing a lot of fetching/syncing with ftp://ftp.ncbi.nlm.nih.gov, and something like this would be incredibly useful.
@transcranial Hi! I've been making slow progress. The design of Dat itself has evolved a lot since I opened this issue, but I still think the idea of this issue is still very accurate and actually would be a lot easier to today as compared to Dat 1.5 years ago.
I've been working on a crawler: https://github.com/maxogden/electron-microscope
Also I should mention @bmpvieira has some NCBI specific tools under the https://github.com/bionode project.
Right now Dat can do a static snapshot of a version of a set of files, but for file sets that change we are still working on a "dynamic" mode for Dat where you get a single Dat link but can subscribe to data changes. Currently Dat links only describe the exact files at the time you create the link.
You should definitely hang out in our Gitter room, or the Code for Science room as well, there are some good discussions related to this topic happening in there lately
https://gitter.im/codeforscience/community https://gitter.im/datproject/discussions