graphtyper
graphtyper copied to clipboard
Error in genotype reading temporary input fasta
I've started a run of graphtyper with a command for each chromosome of my reference: graphtyper genotype --threads 16 --verbose --sams=bamlist --region=H1_8_48-4_1 REF.fna graphtyper genotype --threads 16 --verbose --sams=bamlist --region=H1_8_48-4_2 REF.fna etc.
It seems to run partially but only produces empty vcf files (just the header) for the first chunk of each contig, ie results/H1_8_48-4_1/000000001-000050000.vcf.gz
There is also an error within the run of each command:
[2021-02-20 16:07:05.800913]
When I look at this genome.fa file, it appears to exist and contain all of the contigs in fasta format.
It's hard to say what the problem is, the genome.fa
is opened but no data is read in the region H1_8_48-4_1:50001-100000 while the region H1_8_48-4_1:1-50000 appears to work (no output variants but some reference sequence is read). I have never seen this error before. Can you show me some of the content of those two regions? If you can compress the fasta file and share via Dropbox/Drive that's even better.
Best, Hannes
Hi,
Thank you for looking into it. Here is a link to download the compressed reference genome: https://drive.google.com/file/d/1xmcQkP4yaxidwv_hja0P5UjOkPEwpINS/view?usp=sharing
Best, Alex On Feb 22, 2021, 2:41 AM -0800, Hannes Pétur Eggertsson [email protected], wrote:
It's hard to say what the problem is, the genome.fa is opened but no data is read in the region H1_8_48-4_1:50001-100000 while the region H1_8_48-4_1:1-50000 appears to work (no output variants but some reference sequence is read). I have never seen this error before. Can you show me some of the content of those two regions? If you can compress the fasta file and share via Dropbox/Drive that's even better. Best, Hannes — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Thank you for uploading the FASTA for me. I found the problem which is that my parser for the region name ("H1_8_48-4_1:1-50000") fails because it didn't correctly handle the dash ("-") in the contig name. This is an edge case I didn't test or think about when writing the parser. The problem can be fixed by replacing region.find('-') with region.find('-', colon + 1) in src/graph/genomic_region.cpp:108 and will of course also be fixed in the next graphtyper version.
Best, Hannes
Thank you! I will try that. I also had issues in the past with reference contigs with pipes “|” in the name, such as those downloaded from NCBI. In those cases I just renamed the contigs. The issue there was mainly in the old pipeline script thinking it was a bash pipe command.
Best, Alex On Mar 2, 2021, 7:37 AM -0800, Hannes Pétur Eggertsson [email protected], wrote:
Thank you for uploading the FASTA for me. I found the problem which is that my parser for the region name ("H1_8_48-4_1:1-50000") fails because it didn't correctly handle the dash ("-") in the contig name. This is an edge case I didn't test or think about when writing the parser. The problem can be fixed by replacing region.find('-') with region.find('-', colon + 1) in src/graph/genomic_region.cpp:108 and will of course also be fixed in the next graphtyper version. Best, Hannes — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.