datalad icon indicating copy to clipboard operation
datalad copied to clipboard

Consider command for spliting dataset

Open kyleam opened this issue 5 years ago • 10 comments

Today on the call the idea of a command for splitting a dataset into multiple came up. Off the top of my head, I don't have a description of what the interface should look like or what capabilities should be supported, so this issue is mostly a placeholder to be fleshed out.

I mentioned that @jbpoline had previously discussed splitting datasets with me and @yarikoptic and that I had sent him a follow-up email with a rough sketch of an approach. Here it is in case it's of any use.

email
Subject: splitting dataset by cloning
Date: Fri, 19 Apr 2019 23:03:37 -0400

Hi JB,

Stemming from the discussion today, here's an attempt to split a
repository with a cloning-based method.  It preserves things like
metadata and URLs, but, as I note at the end, the git-annex branch
isn't entirely pruned.

First, here's the setup for a toy repository.  The important thing is
that we're registering URLs and adding metadata that should be preserved
in the split.

  $ cd $(mktemp -dt dl-split-XXXX)
  $ datalad create ds
  $ cd ds && mkdir dir-a dir-b

  $ cd dir-a
  $ datalad download-url -m'dir-a url' \
    https://raw.githubusercontent.com/datalad/artwork/master/brochure/git-annex-logo.svg
  $ git annex metadata --set somekey=val-a git-annex-logo.svg

  $ cd ../dir-b
  $ datalad download-url -m'dir-b url' \
    https://raw.githubusercontent.com/datalad/artwork/master/logos/logo_solo.svg
  $ git annex metadata --set somekey=val-b logo_solo.svg

That gives us a dataset with the following structure:

  ds
  |-- dir-a
  |   `-- git-annex-logo.svg -> ../.git/annex/objects/[...]
  `-- dir-b
      `-- logo_solo.svg -> ../.git/annex/objects/[...]

Now clone that repository into a new one that we will prune.

  $ cd ../..
  $ datalad install -s ds subset && cd subset

At this point, we need to get rid of the files and git history that we
don't want and then create a new dataset (i.e. generate a new dataset
UUID).  One way to do this would be to check out an orphan branch
(e.g., `git checkout --orphan=new`), remove any files that we don't
want, and drop 'datalad.dataset.id' from .datalad/config.  But if the
files we want to keep are limited to a subdirectory, it's simpler to
use filter-branch to create a new, restricted branch.

  $ git filter-branch --subdirectory-filter dir-a

After this, we'll be left with the single commit from above that
touched dir-a.  The working tree no longer contains a .datalad/config
file, so we can simply create a new dataset without worrying about the
previous dataset ID.

  $ datalad create --force

We can disconnect from the source repository by declaring the remote
dead to annex (important for the next step) and removing the git
remote:

  $ git annex dead origin && git remote rm origin

We can now squash history on the git-annex branch, pruning information
about the dead remote.

  $ git branch git-annex-old git-annex  # backup branch for comparison
  $ git annex forget --force --drop-dead

If you diff the new git-annex branch with the old one, you'll see that
the information about the previous remote was dropped, as intended.

  $ git diff git-annex-old git-annex

As expected, but sadly for this use case, information about other keys
for files that we filtered out (in this case the key that was in dir-b,
which we removed) is still available in the git-annex branch.  For
example, here is the metadata information for the file that was in
dir-b:

  $ git grep val-b git-annex
  git-annex:b05/e5b/MD5E-s4327--c9d3989d28cce3495a9b0f679e69f143.svg.log.met

And here's that non-existent file's URL:

  $ git grep --name-only logo_solo git-annex
  git-annex:b05/e5b/MD5E-s4327--c9d3989d28cce3495a9b0f679e69f143.svg.log.web

So, if you're splitting a large repository with this method, you'll
end up with a lot of unneeded content in the git-annex branch, which
is kind of a waste.

Not sure if that's helpful, but figured I'd send it along in case.

kyleam avatar Jul 23 '19 15:07 kyleam

Thanks a lot for detailing this!

I think it is worth having the annex metadata pruning as a feature, i.e.

  • after filter-branch and before annex forget obtain a list of keys referenced in any branch
  • wipe out the corresponding metadata files in the annex branch
  • run annex forget

Moreover, I think it is worth mentioning that this kind of "horizontal" splitting (rebuild history for some part of the directory tree) is one of three ways of pruning a dataset:

  1. "horizontal prune" (as described above)
    • partial history is maintained structurally, but the new dataset cannot fulfill dependencies declared by historic superdatasets
    • useful for taking out information that should have never made it into a dataset
    • reduces number of git objects, brings down inode footprint on disk
  2. "vertical prune" (i.e. forget history)
    • largely the same disadvantages as (1), and similar application, but with the motivation for pruning being located in time (range of commits), rather than in space
    • reduces number of git objects
  3. "incremental prune" (git rm)
    • a dataset has grown to have a too big inode footprint
    • history is fully maintained, but a commit with an git mv/rm to prune/relocate content is added, such that two (now diverged) clones of the same original dataset can be combined in a new superdataset that takes the place of the original repository. Alternatively, a third altered clone can be pruned to only contain the new subdatasets. This way complete backward compatibility can be maintained.

All three methods could be flexibly combined. Let's say a subdataset needs to be split, because some parts of it will disproportionally grow in the future. A set of subsubdataset can be generated using any of the three methods above, whichever makes most sense, and the original subdataset can be pruned via (3), and have the new subsubdatasets added to it at the end. This should provide a smooth upgrade experience for existing consumers, will adding flexibility to any new and existing consumer.

mih avatar Jul 24 '19 08:07 mih

I guess it remains to me to try this out on a couple of examples ! thanks a lot !

jbpoline avatar Jul 24 '19 12:07 jbpoline

@dorianps https://neurostars.org/t/datalad-containers-data-organization-multiple-questions/5509/16?u=yarikoptic wishlist:

Add a merge or split subdataset functions. I.e., if I have a huge dataset and want to split it chunks of subdatasets, datalad should take care of it. It is quite complicated for the user to try do that manually with all the various .git folders, internal references, etc. Same logic goes for merging subdatasets into a larger dataset.

yarikoptic avatar Dec 11 '19 00:12 yarikoptic

Just bumping this issue with https://neurostars.org/t/dividing-existing-datalad-dataset-into-subdatasets/7067 because I encountered this problem too: finding out later that a directory should be its own subdataset.

kimsin98 avatar Jul 08 '21 08:07 kimsin98

FWIW, and now that there is https://git-annex.branchable.com/git-annex-filter-branch/ creating such a "proper" helper could be even easier ;)

yarikoptic avatar Jul 08 '21 20:07 yarikoptic

https://git-scm.com/docs/git-filter-branch has an intimidating list of warnings. How many of them may be applicable to datalad?

Also, for annex repos, should git annex filter-branch precede git filter-branch?

kimsin98 avatar Oct 01 '21 05:10 kimsin98

I would say so.

Might not matter but I would try annex before git

yarikoptic avatar Oct 01 '21 10:10 yarikoptic

Cross-referencing https://github.com/datalad/datalad/issues/600, which could, once resolved provide simple means for splitting a dataset, when there is no interest in preserving prior history.

In https://github.com/datalad/datalad/issues/600#issuecomment-1059206527 and following, I demo how this could be done.

mih avatar Mar 04 '22 13:03 mih

What the interface of such a command should look like?

I feel that it could be

datalad split [-d|--dataset PATH] [--regex-subdatasets REGEX] [-c|--cfg-proc PROC] [-o|--output-path PATH] [--skip-rewrite all,parent,subdataset] [--dry-run] [PATHS] where

  • --regex-subdatsets should allow for a path regexp to be given to establish a possible boundary of a subdataset, e.g. ^sub-.+?/. It is a "more flexible alternative" to [PATHS] but may be we just want to rely on shell globbing etc?
    • recursive splitting?! should we allow for creation of sub-subdatasets with such regex functionality? I can see how it could be useful but would be trickier (if not impossible really -- not sure how it would be to create a git history for intermediate subdataset). Also if specified by [PATHS] then check should be done that neither of them is a subpath of another.
  • --cfg-proc to invoke for any of such new subdatasets. Note that it might be desired to have a different proc for different paths/subdatasets. As a workaround - a custom proc could do path analysis and adjust its behavior - this way we can avoid more complex definition to pair REGEXes and PROCs. Not sure if we should (also) just rely on this or provide some --inherit all,.gitattributes,.gitignore,... flag. Tricky part would be adjusting them for possible subpaths which could be present there. Another tricky part would be if parent dataset used git annex config (e.g. instead of .gitattributes) (ref: https://github.com/datalad/datalad/issues/5383)
  • --output-path - to avoid doing such evil operation in place and rather work under that path
  • may be we would also like to be able to provide custom options for filter-branch (to be used in the top to become superdataset), git subtree split (for subdataset), git annex filter-branch?
  • --skip-rewrite - to avoid rewriting history altogether -- might be not worth it for either super or subdataset or both. That might preclude e.g. announcing original annex dead for superdataset etc.
  • ??? should we allow for explicit option to control either git-annex history should be rewritten ?

@AKSoo had in mind a scenario to "convert" a sub-folder/ into a subdataset. Then it could be as simple as datalad split -c text2git sub-folder.

notes:

  • super-dataset filter-branch should start to operate only from a first commit which had to deal with any of the PATHs to be split away

yarikoptic avatar Aug 02 '22 13:08 yarikoptic