SISRS
SISRS copied to clipboard
Add unit testing/continuous integration
It would be good to implement at least some basic testing and CI. The first step would be to come up with a standard small dataset to run against.
Some considerations with respect to the sample dataset, an idea which I wholly agree with:
-
As you stated, small genomes would be great. I'm currently in hour 60 of a primate analysis, so we should be looking at things that are orders of magnitude smaller.
-
Ideally, at least one taxon from the dataset should have a well-annotated genome to facilitate dowsntream analyses.
-
Data should be a mixed of SE and PE reads in order to properly test different configurations. If SE reads are hard to come by, this could be accomplished artificially of course by partially splitting up a PE dataset.
@BobLiterman do you have (or would possibly be able to create) any small datasets that would be appropriate for unit testing (< 5 seconds to run) and/or integration testing (< 5 minutes)?
Hi @anderspitman, I wonder if the best way to go with that would be to use a preassembled bacterial genome dataset and a small number of real reads? That way, reads are mapping to something with the biological complexity of a real dataset, but in a size that is fast to run? Using the premade flag, one can skip the redundant step of repetitive genome assembly each time a test is run, cutting down on time. Even with genome assembly happening each time, I would anticipate that bacterial datasets would run relatively quickly...
What about like using 5-6 strains of E. coli for the composite genome, then for reads, you could just put a small number in each taxa folder? 10,000? 1,000,000? If the goal is just to benchmark and unit test, I don't see the harm in going on the low end, as you just want the reads to map, sites to be called.
Thoughts all?
Unfortunately my input may not be very useful here. I know how assembly and alignment work, but barely understand what makes SISRS unique.
That said, using bacterial data seems like a good idea. Also using premade would be a great way to start. Since my primary goal for the testing stuff is to provide a baseline for a python port, I'm planning to start with the minimum amount of functionality. This will probably look like doing as much of the pipeline as possible with the old script, and only the final stages with the new code. As I verify each part of the pipeline I can work my way backwards adding more test data and earlier pipeline stages as I go.
Does that make sense?
It makes sense to me. Keep in mind I have added options in SISRS to run specific subroutines alone. In that way you could make benchmark datasets for each stage in the pipeline and save them such that you can run any step you wanted in a standalone fashion.
You'd basically have a raw data folder, and then you would do a SISRS run calling each function, starting from the top and moving down. Each output can be set to go to a new directory, which can act as the starting point for the next function ie)
Raw Data --> Run Genome Assembly into Dir1 (-c 0 option set so SISRS doesn't continue past Step 1) Running from Dir1 --> Run Step 2 into Dir2 Running from Dir2 --> Run Step 3 into Dir3 etc. Running from Dir(n-1) --> Run final step into DirN
Then, as you port bash to Python from the bottom up, you can call as your starting data the directory containing pipeline output from the previous step's output ie)
- Convert last function to Python
- Call last function of SISRS using Dir(n-1) as starting data
- If it works, convert next function up in pipeline to Python
- Call SISRS second to last function (-c 1 so it runs through the end) using the appropriate data etc.
That's a lot and it would be really easy for me to describe either over Skype or some such thing if it's not clear.
If you can give me a few days to get this grant done, I could probably create the data directories relatively quickly, and then you could just clone them all to have a clean version.
@BobLiterman, I'm pretty sure I understand, and I really like this idea! From a testing point of view, there are several advantages. For one, when making changes to a specific part of the pipeline, the developer only has to worry about running unit tests for that stage, which will be a lot faster than running the entire pipeline. Entire pipeline tests would only need to be run when submitting a PR.
Also, when writing tests, you typically have your expected data and your observed data, and you check to verify they are the same. With your proposal the expected data would always be available as the next stage of the pipeline, with the only exception being the final stage.
That would be awesome if you can get the test data put together, separated by directories, with a different directly for each pipeline stage. Just let me know when you're able to do it. I should be able to get quite a bit of the architecture in place to support this.
Also let me know if I seem to be misunderstanding what you're suggesting.
I think we're on the same page. I'll look into this more next week.
@BobLiterman one other thing we need to keep in mind is that we're going to want to have the example data use as many features as possible, in order to test most of the code paths. That may necessitate multiple versions of the same pipeline stage for runs with different options.
I gave this a first crack, with some caveats.
https://github.com/BobLiterman/SISRS_Small_Data
Basics:
- I took 49 contigs from a primate SISRS run. (Damn header row)
- Extracted associated reads for 8 species to recreate raw data
- Created some special case data (2 individuals in a folder, all SE reads, SE + PE reads)
- Ran SISRS step-by-step, outputting to a new folder each time.
Notes:
-
I have not used the reference genome option in my personal work, and don't have a great sense of that side of the program. I didn't run that here because I don't know what I would run it against. Rachel mentioned that it's possible we may ditch it, but we can talk about that if/when needs be.
-
I didn't do anything with the loci half of SISRS yet. I figured we can tackle sites, and then move on from there.
Ran fast for me with 20CPU. Let me know if it suits your needs.
This looks great @BobLiterman! I'll start playing with it and let you know if I have any feedback. Sounds like reference and loci aren't used as much so ignoring them for now is just fine by me. The python port I'm working on will be a great opportunity to drop features we don't need. I'm starting to guess we might actually just have the Python version become SISRS 2.0. That will make it easier for people to continue using any legacy features of 1.x while allowing development to move forward.
Hey @BobLiterman, a couple questions.
Is it necessary to run against all 8 species, or can we get meaningful results from just one or two? I see in the SISRS documentation it says a site only requires 2 species to be included in the final alignment but I'm not sure if the species you used mostly cover the same sites or not.
I'm just trying to minimize the amount of processing needed for each test run. Keep in mind travis only provides 1 CPU as far as I know, and I believe the run time is limited to 5 minutes or so.
Well, in order to identify biallelic sites you need at least 4 taxa. So in that sense you could reduce it that far and still get all the output. That would be as simple as just removing four taxa and re-running the sub-steps, which I could do in a few minutes.
Sweet. I don't think it's necessary to do just yet. I should be able to do all the plumbing with what we have now. But we'll leave that as an optimization if we run into trouble meeting travis' constraints.
What are some good tools for comparing BAM files to determine if they're semantically the same? ie I need to make sure the data I'm generating matches the output in your repo for each stage of the pipeline.
Another question: For the commands listed in the Commands file, what directory did you run each of them from? I'm guessing from within 0_, 1_, etc?
Make raw data (d0), Copy raw folder to dir1, run command 1 in dir1, Copy new dir1 to dir2, Run command 2 in dir2, etc
Comparing BAMs. But keep in mind that mapping won't be identical every time so the files might not be the same and that might be ok.
http://seqanswers.com/forums/showthread.php?t=17644 http://deeptools.readthedocs.io/en/latest/content/tools/bamCompare.html
Thanks @rachelss. We'll definitely need to have deterministic behavior at each pipeline stage in order to properly do CI. That might not be too difficult. bowtie2 for example appears to support this (see this page).
I'll look into this more. Off the top of your heads what sources of non-determinism should I be looking out for? Basically anywhere a random number is used we'll need a way to set the seed the same when running tests.
FYI the bamUtil diff command seems to do exactly what we need for comparing BAM files. I've tested it a bit and even if the files don't match on the binary level (ie using diff or cmp), bamUtil does the job.
@rachelss @BobLiterman I'm noticing some strange behavior. When running the command
sisrs alignContigs -p X -a premade -c 0 -f 0_RawData_PremadeGenome -z output
With different values of X for number of processors, I'm getting different output. Specifically, I noticed that the number of reported reads for GorGor (first line after "==== Aligning GorGor as Single-Ended ====") is different for lower numbers of CPUs. Here's a table of my results:
Num Processors | Reported # Reads |
---|---|
1 | 14690 |
2 | 29380 |
3 | 46012 |
>= 4 | 62644 |
It appears to saturate at 4+ CPUs. I don't really know what, if anything, this indicates. Any ideas?
Also noticed 29380 is 2*14690. The other values increase on the same order but are not exact multiples.
@BobLiterman additionally, when running alignContigs
I haven't been able to get GorGor.bam to match what's in 0_RawData_PremadeGenome, even if I used -p 20 as you did. The output from SISRS is also different, though it does report 62644 reads. Are you sure the commands you ran match what's in the Commands file?
I am able to get identically data between 2 runs on my machine.
Thanks!
**Haven't been able to get it to match what's in 1_alignContigs
Definitely those are the commands. I don't run anything species-specific. Everything was run in batch with the commands as given. I ran the following command using the same data:
sisrs alignContigs -p 5 -a premade -c 0 -f DIR/0_RawData_PremadeGenome/ -z DIR/1_alignContigs_5CPU &> 1_alignContigs_5CPU_Log
I cannot recreate your processor-dependent mapping issue, as my Bowtie logs are identical (see attached file)
Identical with 1CPU as well.
Running the entire SISRS sites program with 5CPU starting from the raw data yielded identical counts of variable, biallelic, and PI sites as well, on my end.
Strange. Yeah when I run the following command:
~/code/sisrs/SISRS/bin/sisrs alignContigs -p8 -a premade -c 0 -f ~/code/sisrs/SISRS/test_data/0_RawData_PremadeGenome/ -z output1 &> alignContigs_Log1
Snippets from the results are below. You can see that the counts are the same (ie 62644 GorGor unpaired reads in both cases), but the percentage breakdowns are quite different. I don't think this is a randomness problem because the results always match between 2 runs I do on my machine (other than the weird CPU <=4 thing). I'm not sure what the problem is. Any ideas? For now I think I'll just regenerate the data for the various stages from your source data, since that seems to be working as long as it's all on the same machine. Might be a problem in the long run though when we try to deploy to the continuous integration server.
Here's mine:
==== Aligning GorGor as Single-Ended ====
62644 reads; of these:
62644 (100.00%) were unpaired; of these:
9316 (14.87%) aligned 0 times
53258 (85.02%) aligned exactly 1 time
70 (0.11%) aligned >1 times
85.13% overall alignment rate
==== Aligning HomSap as Single-Ended ====
28712 reads; of these:
28712 (100.00%) were unpaired; of these:
2568 (8.94%) aligned 0 times
26073 (90.81%) aligned exactly 1 time
71 (0.25%) aligned >1 times
91.06% overall alignment rate
==== Aligning HylMol as Single-Ended ====
24872 reads; of these:
24872 (100.00%) were unpaired; of these:
6883 (27.67%) aligned 0 times
17935 (72.11%) aligned exactly 1 time
54 (0.22%) aligned >1 times
72.33% overall alignment rate
==== Aligning MacFas as Single-Ended ====
14760 reads; of these:
14760 (100.00%) were unpaired; of these:
3690 (25.00%) aligned 0 times
11028 (74.72%) aligned exactly 1 time
42 (0.28%) aligned >1 times
75.00% overall alignment rate
==== Aligning MacMul as Single-Ended ====
20392 reads; of these:
20392 (100.00%) were unpaired; of these:
6249 (30.64%) aligned 0 times
14109 (69.19%) aligned exactly 1 time
34 (0.17%) aligned >1 times
69.36% overall alignment rate
==== Aligning PanPan as Single-Ended ====
86790 reads; of these:
86790 (100.00%) were unpaired; of these:
13165 (15.17%) aligned 0 times
73573 (84.77%) aligned exactly 1 time
52 (0.06%) aligned >1 times
84.83% overall alignment rate
==== Aligning PanTro as Single-Ended ====
86272 reads; of these:
86272 (100.00%) were unpaired; of these:
13462 (15.60%) aligned 0 times
72735 (84.31%) aligned exactly 1 time
75 (0.09%) aligned >1 times
84.40% overall alignment rate
==== Aligning PonPyg as Single-Ended ====
29114 reads; of these:
29114 (100.00%) were unpaired; of these:
9069 (31.15%) aligned 0 times
20019 (68.76%) aligned exactly 1 time
26 (0.09%) aligned >1 times
68.85% overall alignment rate
and yours:
==== Aligning GorGor as Single-Ended ====
62644 reads; of these:
62644 (100.00%) were unpaired; of these:
7611 (12.15%) aligned 0 times
54898 (87.63%) aligned exactly 1 time
135 (0.22%) aligned >1 times
87.85% overall alignment rate
==== Aligning HomSap as Single-Ended ====
28712 reads; of these:
28712 (100.00%) were unpaired; of these:
2062 (7.18%) aligned 0 times
26589 (92.61%) aligned exactly 1 time
61 (0.21%) aligned >1 times
92.82% overall alignment rate
==== Aligning HylMol as Single-Ended ====
24872 reads; of these:
24872 (100.00%) were unpaired; of these:
5207 (20.94%) aligned 0 times
19588 (78.76%) aligned exactly 1 time
77 (0.31%) aligned >1 times
79.06% overall alignment rate
==== Aligning MacFas as Single-Ended ====
14760 reads; of these:
14760 (100.00%) were unpaired; of these:
3155 (21.38%) aligned 0 times
11547 (78.23%) aligned exactly 1 time
58 (0.39%) aligned >1 times
78.62% overall alignment rate
==== Aligning MacMul as Single-Ended ====
20392 reads; of these:
20392 (100.00%) were unpaired; of these:
4972 (24.38%) aligned 0 times
15373 (75.39%) aligned exactly 1 time
47 (0.23%) aligned >1 times
75.62% overall alignment rate
==== Aligning PanPan as Single-Ended ====
86790 reads; of these:
86790 (100.00%) were unpaired; of these:
10831 (12.48%) aligned 0 times
75868 (87.42%) aligned exactly 1 time
91 (0.10%) aligned >1 times
87.52% overall alignment rate
==== Aligning PanTro as Single-Ended ====
86272 reads; of these:
86272 (100.00%) were unpaired; of these:
11040 (12.80%) aligned 0 times
75110 (87.06%) aligned exactly 1 time
122 (0.14%) aligned >1 times
87.20% overall alignment rate
==== Aligning PonPyg as Single-Ended ====
29114 reads; of these:
29114 (100.00%) were unpaired; of these:
7024 (24.13%) aligned 0 times
22059 (75.77%) aligned exactly 1 time
31 (0.11%) aligned >1 times
75.87% overall alignment rate
@BobLiterman I've been trying to get consistent behavior without too much success. I think I've narrowed a good chunk of it down to a specific problem. Can you run the following commands from within the GorGor
directory and share your output?
bowtie2-build ../premadeoutput/contigs.fa ./contigs
bowtie2 -p 1 -N 1 --local -x ./contigs -U GorGor_A_R1.fastq,GorGor_A_R2.fastq,GorGor_B_R1.fastq,GorGor_B_R2.fastq > out.sam
When varying -p X I am getting different answers for anything less than 4. Tried on 2 different machines. Could this be a bug in bowtie or am I just completely misunderstanding what the output means?