ga4gh-server
ga4gh-server copied to clipboard
BAM and BAM index files in source tree are not easy to maintain
Having only the binary SAM files in the tree makes maintenance difficult, as once can't diff mods etc. While it might be desirable to not require samtools to run, one can check in acsii SAM files as the masters, along with the derived BAM/BAI files.
I have no objection to this. We should be able to use pysam to convert simply from SAM to BAM and generate the indexes, so in theory we should be able to just keep the SAM files, and generate the BAMs and indexes on the fly. I don't know how practical this would be though...
I'm generally -1 on this. As long as there's a README specifying how to recreate the BAM, that's fine with me.
Unless we want to get into the business of generating data on server startup, which we seem to be shying away from (e.g. Md5 generation for references) since that would make server startup way slow.
-1 for the way the ticket was written, because that is not what I intended.
The main goal is to have the master data files in a format that is maintainable with a VCS, that is ASCII text. So things like diff, blame, etc will work. We will want to change these.
Having to decode files to modify and then re-encode loses all of this functionality. They need to be actually versioned control at the line level if at all possible.
This is independent of if you also keep the binary files in tree. It appears to be of value that the binary copies of the files are checked in as blobs. Having code (script, make rule, what ever) to recreate the binary from the ASCII if they change will make it easy.
I don't have all the information about this, but generally this is approached by having a build rule to generate the binary version of the data files rather than checking them in or creating them every time the server runs. Now it might not make sense for the BAMs because this now requires bamtools to be installed.
Danny Colligan [email protected] writes:
I'm generally -1 on this. As long as there's a README specifying how to recreate the BAM, that's fine with me.
Unless we want to get into the business of generating data on server startup, which we seem to be shying away from (e.g. Md5 generation for references) since that would make server startup way slow.
— Reply to this email directly or view it on GitHub.*
I'm inclined to agree re desirability of SAM master copies in ASCII. However you can get git diff and blame working for BAM files via git's textconv
settings and e.g. samtools view; see samtools/.gitattributes, which probably should be suggesting
git config --global diff.bam.textconv "samtools view -h"
This turns out to be gloriously handy.
Having BAI files in git can lead to unexpected pain as it's occasionally impossible to affect their file timestamps: cf samtools/samtools@5e6d7bdbfbbc12d32328614fba99560895b161a4. But this is probably not an issue for non-trivial data files.
It's not obvious to me that we can't build the BAM files and indexes online when we need them. We're only using these as test input and the files are tiny, so in theory we could do the following as a module-level test fixture:
- Create a temporary directory to hold the indexed BAM files, with whatever structure makes most sense for the test we're running;
- Convert the input SAM files (which are held in tests/data somewhere) into BAM and index them using pysam. Change the paths we're using in the tests to point to the temporary directory.
Then, we only keep the original ASCII text versions in git, which are much easier to manage. I would say we should do the same for the gzipped VCFs: only store the original text versions, and convert them into the required format on the fly, on a module-by-module basis.
As far as implementation goes, it would probably just be easier to keep the binary files in the same directory as the text files and add the binary file extensions to .gitignore (if they're not there already).
The general thrust would definitely lengthen the time it would take to run tests. I don't know by how much or if that would be significant. Plus there would also be caching issues to deal with.
Sounds good. If it makes sense, then keeping the blobs is fine. But lets test how long building them on the fly is going to take.
Danny Colligan [email protected] writes:
As far as implementation goes, it would probably just be easier to keep the binary files in the same directory as the text files and add the binary file extensions to .gitignore (if they're not there already).
The general thrust would definitely lengthen the time it would take to run tests. I don't know by how much or if that would be significant.
— Reply to this email directly or view it on GitHub.*
new compliance data is all in human-readable format. Are we intending to standardize on this new dataset as the "example", in which case, we could close this issue?
This is a bug on the serve example data, which is different that the compliance data. The compliance data is intended to have weird cases and might not be the best example.
Maciek Smuga-Otto [email protected] writes:
new compliance data is all in human-readable format. Are we intending to standardize on this new dataset as the "example", in which case, we could close this issue?
— Reply to this email directly or view it on GitHub.*
Yes, the example data is intentionally different. We have a nice script to generate the example data now, which we'll use to generate the next version of the example data.
Close this issue by converting the BAM to SAM in https://github.com/ga4gh/ga4gh-server/tree/master/tests/data/datasets/dataset1/reads
Then change scripts/build_test_data.py
to first convert these SAM to BAM before loading into the test registry.