brainglobe-atlasapi icon indicating copy to clipboard operation
brainglobe-atlasapi copied to clipboard

Moving toward better atlas versioning

Open vigji opened this issue 3 years ago • 1 comments

This issue comes from a long discussion that I had with @nickdelgrosso on those points. Will lay it out in full here, so that we can discuss it better at the meeting.

How does the atlas generation workflow work now

Currently, there are 3 steps at work behind the curtains when a user ask for a bg-atlasapi atlas.

  1. Download the data from a remote source (eg, the Allen website, or a BrainGlobe GIN repo), and structure it according to BrainGlobe specifications;
  2. Upload the compressed result to the brainglobe GIN repo.

(Those 2 steps are taken care of by brainglobe developers, using the code in bg-atlasgen. There, consistency of the atlas structure across atlases is ensured by always invoking the same wrap_atlas() function; and some utility allows to easily re-run generation of all atlases if something is changed in the format.

  1. Download it to the user directory

(This third step is what happens when a user instantiate locally an atlas, if the atlas is not yet available locally.)

Current atlas versioning schema consists of a mayor and a minor name; minor is upgraded for fixes in individual atlases, and major when something changes in the brainglobe atlas format (eg. a new metadata field, or different format for the stacks). Every time there is a change in the source data or in brainglobe format, brainglobe developers need to run again 1. and 2. for one or multiple atlases. Some code in bg-atlasapi check for the user local version and available remote versions, and prompts message to download again if required

Problems of the current formulation

  • atm there is no easy way to redo an analysis using an old version of bg-atlasapi and get an atlas that would work with that version, which would be very important for reproducibility;
  • we are not using at all version control functionality that GIN might offer;
  • we need some more solid validation code that given a folder with an atlas, validates it and check nothing would break when using it (important eg while developing an atlas). Atm this is a bit scattered over both bg-atlasgen and bg-atlasapi, and to understand how an atlas should be a person actually need to check both places. Validation code should live entirely in bg-atlasapi, although a separate repository for the brainglobe-maintained atlases generation is still helpful.

What we propose

  • Have a GIN repository for each atlas, and release it with a DOI every time a stable version is ready; in this way, it is citable, archived, and GIN ensure at least 10 year permanence of the data;
  • Have in bg-atlasapi validation code that can ensure that a folder is a valid atlas source, to facilitate the development of the atlas from third parts who would not have to look into the bg-atlasgen repo (which should remain for our use, nonetheless);
  • Have code in the BrainGlobeAtlas class to try to instantiate it with an arbitrary stable version, (or, for developers, even with a specific commit of the atlas repo); validation code would ensure that that specific version of the atlas is compatible with the bg-atlasapi version of the user.

There is an additional problem that is that different atlas resolutions end up in a lot of duplication (eg, meshes are duplicated for each atlas package at a specific resolution). Ideally, we should not need a different GIN repo for each resolution version of an atlas; on the other side, we also don't want to download a zipped file with all the different resolutions if we care only about one. It would be great to hear smart ideas for this point!

Here's a diagram of the new flow from @nickdelgrosso: Screenshot 2021-01-13 at 18 34 31

vigji avatar Jan 13 '21 17:01 vigji

Update after the meeting discussion @FedeClaudi @adamltyson @nickdelgrosso

It was decided that allowing for more flexibility and do separately versioning of the reference stacks, annotations, and meshes would have multiple advantages:

  1. It would remove the redundancies produced by multiplying eg the meshes in atlases that differ just in their resolution;
  2. It would facilitate the creation of atlases that are based upon segmentation of the same reference (eg, different segmentations of the Allen template brain);
  3. It would give more flexibility in terms of how much need to be there for an atlas; eg, at that point an atlas could exist even with no meshes for it, until someone generate them.

This will require:

  1. a new definition of the atlas; Atlases become collections of sources from where one need to retrieve the reference, annotation, meshes etc. as needed. This would give us for free the flexibility for not downloading necessarily a full atlas package if we just use part of it, eg the reference.
  2. to move the code that now runs the creation of the standardised atlas and make it becoming validation functions that run when one download the data for the first time, to make sure that the source data satisfy the requirements of the API (eg, proper format for regions annotation etc.)

Remaining points that need to be understood before/during implementation:

  1. This flexibility is good but we should we also guarantee proper archiving on our GIN of all the versions we offer, to ensure future analysis reproducibility? This can live in an entirely parallel way from the custom distribution of people who generated the atlas.
  2. Decide whether we also loose restrictions on the atlases, eg allow for different orientations or stack formats, as long as they fit in proper usage cases (the orientation is described à là bg-space/the stack can be loaded by a library as @adamltyson 's).

vigji avatar Feb 10 '21 12:02 vigji