Overlapping dat repositories and data deduplication

Open mitar opened this issue 7 years ago • 1 comments

I have few hundred datasets of the following structure (which I cannot change):

full_dataset/
train_dataset/
test_dataset/
extra_files

Currently I am creating for dat repositories for each dataset: for each of full, train, test directories, and for the main directory as a whole. The reason is that we would need that users can fetch just a particular view of the dataset. Or that they can fetch everything.

There is some overlap in data between those views. They in fact have same raw files, only that train and test datasets are a split from full datasets. So raw files are in fact stored twice.

From my understanding each dat repository is separate from each other. In this case it means that data is duplicated and that there is no deduplication done for an user who wants to download everything and separate directories as well.

So my question is: could be there some way to define bundles of dat repositories which would work together and have the same underlaying swarm so that both sharing and deduplication would be done together? In our case we have same users for all our datasets, so this would work great.

Feb 21 '18 21:02 mitar

This is actually being discussed right now at https://github.com/mafintosh/hyperdrive/issues/203#issuecomment-367436804

Feb 21 '18 21:02 pfrazee