graphtyper icon indicating copy to clipboard operation
graphtyper copied to clipboard

Error in genotype reading temporary input fasta

Open alexweisberg opened this issue 3 years ago • 4 comments

I've started a run of graphtyper with a command for each chromosome of my reference: graphtyper genotype --threads 16 --verbose --sams=bamlist --region=H1_8_48-4_1 REF.fna graphtyper genotype --threads 16 --verbose --sams=bamlist --region=H1_8_48-4_2 REF.fna etc.

It seems to run partially but only produces empty vcf files (just the header) for the first chunk of each contig, ie results/H1_8_48-4_1/000000001-000050000.vcf.gz

There is also an error within the run of each command: [2021-02-20 16:07:05.800913] Running the 'genotype' subcommand. [2021-02-20 16:07:07.307710] Genotyping region H1_8_48-4_1:1-50000 [2021-02-20 16:07:07.307761] Path to genome is 'REF.fna' [2021-02-20 16:07:07.307775] Running with up to 16 threads. [2021-02-20 16:07:07.307787] Copying data from 254 input SAM/BAM/CRAMs to local disk. [2021-02-20 16:07:07.308848] Temporary folder is /home/myaccountname/tmp/graphtyper_210220_160707_H1_8_48-4_1_000000001.fan0YQ [2021-02-20 16:07:07.600229] Copying reference genome FASTA and its index to temporary folder. [2021-02-20 16:07:08.054201] Running bamShrink. [2021-02-20 16:08:21.152765] Finished copying data. Thread work: 16/16/18/16/17/16/16/16/16/15/14/16/16/15/16/15 [2021-02-20 16:08:21.217432] Skipping merging step. Max files open are 512 [2021-02-20 16:08:21.217470] Number of bamShrinked files are 254 running accross 16 threads. [2021-02-20 16:08:21.217493] Initial variant discovery step starting. [2021-02-20 16:08:21.987306] Finished initial variant discovery step. Thread work info: 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1 [2021-02-20 16:08:22.008219] vcf.cpp:1040 no variants to write to VCF. [2021-02-20 16:08:22.020029] Further variant discovery step starting. [2021-02-20 16:08:25.554158] Finished calling. Thread work: 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1 [2021-02-20 16:08:25.608084] vcf.cpp:1040 no variants to write to VCF. [2021-02-20 16:08:25.630614] Call step 1 starting. [2021-02-20 16:08:28.902938] Finished calling. Thread work: 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1 [2021-02-20 16:08:28.952581] vcf.cpp:1040 no variants to write to VCF. [2021-02-20 16:08:28.964394] Call step 2 starting. [2021-02-20 16:08:32.241503] Finished calling. Thread work: 1/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1 [2021-02-20 16:08:32.241661] Merging output VCFs. [2021-02-20 16:08:32.271422] vcf.cpp:1040 no variants to write to VCF. [2021-02-20 16:08:32.288074] Copying results to output directory. [2021-02-20 16:08:33.332411] Cleaning up temporary files. [2021-02-20 16:08:33.913897] Finished! Output written at: results/H1_8_48-4_1/000000001-000050000.vcf.gz [2021-02-20 16:08:33.914035] Genotyping region H1_8_48-4_1:50001-100000 [2021-02-20 16:08:33.914052] Path to genome is 'REF.fna' [2021-02-20 16:08:33.914064] Running with up to 16 threads. [2021-02-20 16:08:33.914075] Copying data from 254 input SAM/BAM/CRAMs to local disk. [2021-02-20 16:08:33.915869] Temporary folder is /home/myaccountname/tmp/graphtyper_210220_160833_H1_8_48-4_1_000050001.2tm4zO [2021-02-20 16:08:33.935231] Copying reference genome FASTA and its index to temporary folder. [2021-02-20 16:08:34.765225] Running bamShrink. [2021-02-20 16:09:10.934454] Finished copying data. Thread work: 16/17/15/16/18/15/16/16/17/15/16/17/14/16/15/15 [2021-02-20 16:09:11.903143] Skipping merging step. Max files open are 512 [2021-02-20 16:09:11.903203] Number of bamShrinked files are 254 running accross 16 threads. [2021-02-20 16:09:11.903221] Initial variant discovery step starting. [2021-02-20 16:09:11.905239] constructor.cpp:1733 Failed reading input FASTA file /home/myaccountname/tmp/graphtyper_210220_160833_H1_8_48-4_1_000050001.2tm4zO/genome.fa

When I look at this genome.fa file, it appears to exist and contain all of the contigs in fasta format.

alexweisberg avatar Feb 21 '21 01:02 alexweisberg

It's hard to say what the problem is, the genome.fa is opened but no data is read in the region H1_8_48-4_1:50001-100000 while the region H1_8_48-4_1:1-50000 appears to work (no output variants but some reference sequence is read). I have never seen this error before. Can you show me some of the content of those two regions? If you can compress the fasta file and share via Dropbox/Drive that's even better.

Best, Hannes

hannespetur avatar Feb 22 '21 10:02 hannespetur

Hi,

Thank you for looking into it. Here is a link to download the compressed reference genome: https://drive.google.com/file/d/1xmcQkP4yaxidwv_hja0P5UjOkPEwpINS/view?usp=sharing

Best, Alex On Feb 22, 2021, 2:41 AM -0800, Hannes Pétur Eggertsson [email protected], wrote:

It's hard to say what the problem is, the genome.fa is opened but no data is read in the region H1_8_48-4_1:50001-100000 while the region H1_8_48-4_1:1-50000 appears to work (no output variants but some reference sequence is read). I have never seen this error before. Can you show me some of the content of those two regions? If you can compress the fasta file and share via Dropbox/Drive that's even better. Best, Hannes — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

alexweisberg avatar Feb 24 '21 18:02 alexweisberg

Thank you for uploading the FASTA for me. I found the problem which is that my parser for the region name ("H1_8_48-4_1:1-50000") fails because it didn't correctly handle the dash ("-") in the contig name. This is an edge case I didn't test or think about when writing the parser. The problem can be fixed by replacing region.find('-') with region.find('-', colon + 1) in src/graph/genomic_region.cpp:108 and will of course also be fixed in the next graphtyper version.

Best, Hannes

hannespetur avatar Mar 02 '21 15:03 hannespetur

Thank you! I will try that. I also had issues in the past with reference contigs with pipes “|” in the name, such as those downloaded from NCBI. In those cases I just renamed the contigs. The issue there was mainly in the old pipeline script thinking it was a bash pipe command.

Best, Alex On Mar 2, 2021, 7:37 AM -0800, Hannes Pétur Eggertsson [email protected], wrote:

Thank you for uploading the FASTA for me. I found the problem which is that my parser for the region name ("H1_8_48-4_1:1-50000") fails because it didn't correctly handle the dash ("-") in the contig name. This is an edge case I didn't test or think about when writing the parser. The problem can be fixed by replacing region.find('-') with region.find('-', colon + 1) in src/graph/genomic_region.cpp:108 and will of course also be fixed in the next graphtyper version. Best, Hannes — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

alexweisberg avatar Mar 02 '21 18:03 alexweisberg