ga4gh-server icon indicating copy to clipboard operation
ga4gh-server copied to clipboard

BAM and BAM index files in source tree are not easy to maintain

Open diekhans opened this issue 9 years ago • 11 comments

Having only the binary SAM files in the tree makes maintenance difficult, as once can't diff mods etc. While it might be desirable to not require samtools to run, one can check in acsii SAM files as the masters, along with the derived BAM/BAI files.

diekhans avatar May 23 '15 06:05 diekhans

I have no objection to this. We should be able to use pysam to convert simply from SAM to BAM and generate the indexes, so in theory we should be able to just keep the SAM files, and generate the BAMs and indexes on the fly. I don't know how practical this would be though...

jeromekelleher avatar May 26 '15 16:05 jeromekelleher

I'm generally -1 on this. As long as there's a README specifying how to recreate the BAM, that's fine with me.

Unless we want to get into the business of generating data on server startup, which we seem to be shying away from (e.g. Md5 generation for references) since that would make server startup way slow.

dcolligan avatar May 26 '15 22:05 dcolligan

-1 for the way the ticket was written, because that is not what I intended.

The main goal is to have the master data files in a format that is maintainable with a VCS, that is ASCII text. So things like diff, blame, etc will work. We will want to change these.

Having to decode files to modify and then re-encode loses all of this functionality. They need to be actually versioned control at the line level if at all possible.

This is independent of if you also keep the binary files in tree. It appears to be of value that the binary copies of the files are checked in as blobs. Having code (script, make rule, what ever) to recreate the binary from the ASCII if they change will make it easy.

I don't have all the information about this, but generally this is approached by having a build rule to generate the binary version of the data files rather than checking them in or creating them every time the server runs. Now it might not make sense for the BAMs because this now requires bamtools to be installed.

Danny Colligan [email protected] writes:

I'm generally -1 on this. As long as there's a README specifying how to recreate the BAM, that's fine with me.

Unless we want to get into the business of generating data on server startup, which we seem to be shying away from (e.g. Md5 generation for references) since that would make server startup way slow.

— Reply to this email directly or view it on GitHub.*

diekhans avatar May 27 '15 04:05 diekhans

I'm inclined to agree re desirability of SAM master copies in ASCII. However you can get git diff and blame working for BAM files via git's textconv settings and e.g. samtools view; see samtools/.gitattributes, which probably should be suggesting

git config --global diff.bam.textconv "samtools view -h"

This turns out to be gloriously handy.

Having BAI files in git can lead to unexpected pain as it's occasionally impossible to affect their file timestamps: cf samtools/samtools@5e6d7bdbfbbc12d32328614fba99560895b161a4. But this is probably not an issue for non-trivial data files.

jmarshall avatar May 27 '15 08:05 jmarshall

It's not obvious to me that we can't build the BAM files and indexes online when we need them. We're only using these as test input and the files are tiny, so in theory we could do the following as a module-level test fixture:

  1. Create a temporary directory to hold the indexed BAM files, with whatever structure makes most sense for the test we're running;
  2. Convert the input SAM files (which are held in tests/data somewhere) into BAM and index them using pysam. Change the paths we're using in the tests to point to the temporary directory.

Then, we only keep the original ASCII text versions in git, which are much easier to manage. I would say we should do the same for the gzipped VCFs: only store the original text versions, and convert them into the required format on the fly, on a module-by-module basis.

jeromekelleher avatar May 27 '15 10:05 jeromekelleher

As far as implementation goes, it would probably just be easier to keep the binary files in the same directory as the text files and add the binary file extensions to .gitignore (if they're not there already).

The general thrust would definitely lengthen the time it would take to run tests. I don't know by how much or if that would be significant. Plus there would also be caching issues to deal with.

dcolligan avatar May 27 '15 15:05 dcolligan

Sounds good. If it makes sense, then keeping the blobs is fine. But lets test how long building them on the fly is going to take.

Danny Colligan [email protected] writes:

As far as implementation goes, it would probably just be easier to keep the binary files in the same directory as the text files and add the binary file extensions to .gitignore (if they're not there already).

The general thrust would definitely lengthen the time it would take to run tests. I don't know by how much or if that would be significant.

— Reply to this email directly or view it on GitHub.*

diekhans avatar May 27 '15 15:05 diekhans

new compliance data is all in human-readable format. Are we intending to standardize on this new dataset as the "example", in which case, we could close this issue?

macieksmuga avatar Sep 01 '15 21:09 macieksmuga

This is a bug on the serve example data, which is different that the compliance data. The compliance data is intended to have weird cases and might not be the best example.

Maciek Smuga-Otto [email protected] writes:

new compliance data is all in human-readable format. Are we intending to standardize on this new dataset as the "example", in which case, we could close this issue?

— Reply to this email directly or view it on GitHub.*

diekhans avatar Sep 02 '15 02:09 diekhans

Yes, the example data is intentionally different. We have a nice script to generate the example data now, which we'll use to generate the next version of the example data.

jeromekelleher avatar Sep 02 '15 12:09 jeromekelleher

Close this issue by converting the BAM to SAM in https://github.com/ga4gh/ga4gh-server/tree/master/tests/data/datasets/dataset1/reads

Then change scripts/build_test_data.py to first convert these SAM to BAM before loading into the test registry.

david4096 avatar Mar 09 '17 03:03 david4096