Using Telomere-to-Telomere as reference
For my analysis I wish to use the Telomere-to-Telomere human reference genome of which there are two assemblies; RefSeq assembly (GCF_009914755.1) and GenBank assembly (GCA_009914755.4). I've aligned my data to both these references separately. When I use the GenBank assembly Straglr runs just fine, but with the RefSeq assembly Straglr just finishes in 1 second and makes empty output files with only headers. It does not throw an error or anything.
The only major differences between the two assemblies, as far as I can see, is that RefSeq's does not have the mitochondrial genome (which shouldn't make any difference in this matter) and that they have different naming conventions for the chromosomes. For example:
| hg38 | GCF_009914755.1 (RefSeq) | GCA_009914755.4 (GenBank) |
|---|---|---|
| chr1 | NC_060925.1 | CP068277.2 |
| chr2 | NC_060926.1 | CP068276.2 |
| chr3 | NC_060927.1 | CP068275.2 |
| ... | ... | ... |
My only wild guess (which I don't really believe to be the reason) is that Straglr does not like the underscores in RefSeq's naming convention. Other than that I'm at a loss as to what the problem could be.
I just did a test with a small region of the genome where I know there is a repeat present. I removed the underscores from the RefSeq chromosome names (NC_060925.1 --> NC060925.1) in both the reference and bam file and it worked!! Could the code possibly be updated to handle underscores in chromosome names? :)
Please try running it with the option --include_partials, it should include chromosomes with underscores in their names
I used --include_alt_chroms and it worked like a charm :) Thanks for your help and sorry I didn't look thoroughly at the paramaters before submitting an issue.