Monopogen
Monopogen copied to clipboard
Issues in the test file chr20.maester_scRNA.bam for somatic mutation calling
Hi,
Thanks so much for developing this tool! I've been trying to get the test dataset running to call somatic mutations, but have been having some trouble. I downloaded the file chr20.maester_scRNA.bam
from https://drive.google.com/file/d/1nS2rjrab-QSiq-FhpTWOtJesCE9iS_0k/view?usp=share_link, but it appears to be broken.
I ran the code in this directory:
$ ls
bam.lst
CB_7K.maester_scRNA.csv
CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chr20_2Mb.hg38.fa
chr20_2Mb.hg38.fa.fai
chr20.maester_scRNA.bam
chr20.maester_scRNA.bam.bai
region.lst
The downloaded chr20.maester_scRNA.bam
file looks like this:
$ samtools view chr20.maester_scRNA.bam | head -2
SRR15598776.294805248_TCTTGCGCAGGCACAC_CTCTCCTCGCCA 2064 20 60218 0 10H20M61H * 0 0 GCAATCCTTTCCTCTCCATT ,,:,:,,,,,,,,,,,,::F NM:i:0 MD:Z:20 AS:i:20 XS:i:19 SA:Z:5,113545723,-,22S37M32S,0,0;5,156763440,+,22S19M50S,0,0; XA:Z:11,+88315939,60S19M12S,0;
SRR15598776.80450257_AATGAAGTCGCGGACT_AAGTATACGCTC 16 20 61569 0 43S19M29S * 0 0 TATACATATGTCCCCTCCCTCTTGAATCTCACCTTCTACCCCCGACTCCATTCCACTCCTCTAGGTTGTCACAGAGCACCGCATTTGGCTC FF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:19 AS:i:19 XS:i:19 SA:Z:5,64380457,-,19S19M53S,0,0;15,76070764,-,5S19M67S,0,0;X,10097594,-,67S19M5S,0,0; XA:Z:Y,-10959620,44S19M28S,0;KI270736.1,+62338,28S19M44S,0;
$ samtools view chr20.maester_scRNA.bam | tail -2
SRR15598776.418134198_TTTCATGAGCAACTTC_TCACGTTTCCTA 2064 20 39999757 0 25H24M42H * 0 0 ATATTTAAAATTTTATTTTTTAAT ,,:FF,:,FF,,::FF,,,FF,:, NM:i:0 MD:Z:24 AS:i:24 XS:i:23 SA:Z:X,46457624,-,3M1I25M62S,0,1;6,124139067,-,53S23M15S,0,0;7,129403981,-,66S22M3S,0,0;8,102417610,+,30S20M41S,0,0;
SRR15598776.180926021_GACTCAAAGTTCACTG_ACCATATAAAAT 2048 20 39999869 0 70H21M * 0 0 ATTTACTATAAAAATAGTTAC F,,:,,,,,:::,,,,,,,,, NM:i:1 MD:Z:20T0 AS:i:20 XS:i:20 SA:Z:1,82576961,-,21S39M31S,0,0;4,180292349,+,20S19M52S,0,0; XA:Z:5,-14292491,7S25M59S,1;11,+95173927,67S19M5S,0;14,-21926828,8S19M64S,0;
$ samtools view -H chr20.maester_scRNA.bam
@HD VN:1.3 SO:coordinate
@SQ SN:1 LN:248956422
@SQ SN:10 LN:133797422
@SQ SN:11 LN:135086622
@SQ SN:12 LN:133275309
@SQ SN:13 LN:114364328
@SQ SN:14 LN:107043718
@SQ SN:15 LN:101991189
@SQ SN:16 LN:90338345
@SQ SN:17 LN:83257441
@SQ SN:18 LN:80373285
@SQ SN:19 LN:58617616
@SQ SN:2 LN:242193529
@SQ SN:20 LN:64444167
@SQ SN:21 LN:46709983
@SQ SN:22 LN:50818468
@SQ SN:3 LN:198295559
@SQ SN:4 LN:190214555
@SQ SN:5 LN:181538259
@SQ SN:6 LN:170805979
@SQ SN:7 LN:159345973
@SQ SN:8 LN:145138636
@SQ SN:9 LN:138394717
@SQ SN:MT LN:16569
@SQ SN:X LN:156040895
@SQ SN:Y LN:57227415
@SQ SN:KI270728.1 LN:1872759
@SQ SN:KI270727.1 LN:448248
@SQ SN:KI270442.1 LN:392061
@SQ SN:KI270729.1 LN:280839
@SQ SN:GL000225.1 LN:211173
@SQ SN:KI270743.1 LN:210658
@SQ SN:GL000008.2 LN:209709
@SQ SN:GL000009.2 LN:201709
@SQ SN:KI270747.1 LN:198735
@SQ SN:KI270722.1 LN:194050
@SQ SN:GL000194.1 LN:191469
@SQ SN:KI270742.1 LN:186739
@SQ SN:GL000205.2 LN:185591
@SQ SN:GL000195.1 LN:182896
@SQ SN:KI270736.1 LN:181920
@SQ SN:KI270733.1 LN:179772
@SQ SN:GL000224.1 LN:179693
@SQ SN:GL000219.1 LN:179198
@SQ SN:KI270719.1 LN:176845
@SQ SN:GL000216.2 LN:176608
@SQ SN:KI270712.1 LN:176043
@SQ SN:KI270706.1 LN:175055
@SQ SN:KI270725.1 LN:172810
@SQ SN:KI270744.1 LN:168472
@SQ SN:KI270734.1 LN:165050
@SQ SN:GL000213.1 LN:164239
@SQ SN:GL000220.1 LN:161802
@SQ SN:KI270715.1 LN:161471
@SQ SN:GL000218.1 LN:161147
@SQ SN:KI270749.1 LN:158759
@SQ SN:KI270741.1 LN:157432
@SQ SN:GL000221.1 LN:155397
@SQ SN:KI270716.1 LN:153799
@SQ SN:KI270731.1 LN:150754
@SQ SN:KI270751.1 LN:150742
@SQ SN:KI270750.1 LN:148850
@SQ SN:KI270519.1 LN:138126
@SQ SN:GL000214.1 LN:137718
@SQ SN:KI270708.1 LN:127682
@SQ SN:KI270730.1 LN:112551
@SQ SN:KI270438.1 LN:112505
@SQ SN:KI270737.1 LN:103838
@SQ SN:KI270721.1 LN:100316
@SQ SN:KI270738.1 LN:99375
@SQ SN:KI270748.1 LN:93321
@SQ SN:KI270435.1 LN:92983
@SQ SN:GL000208.1 LN:92689
@SQ SN:KI270538.1 LN:91309
@SQ SN:KI270756.1 LN:79590
@SQ SN:KI270739.1 LN:73985
@SQ SN:KI270757.1 LN:71251
@SQ SN:KI270709.1 LN:66860
@SQ SN:KI270746.1 LN:66486
@SQ SN:KI270753.1 LN:62944
@SQ SN:KI270589.1 LN:44474
@SQ SN:KI270726.1 LN:43739
@SQ SN:KI270735.1 LN:42811
@SQ SN:KI270711.1 LN:42210
@SQ SN:KI270745.1 LN:41891
@SQ SN:KI270714.1 LN:41717
@SQ SN:KI270732.1 LN:41543
@SQ SN:KI270713.1 LN:40745
@SQ SN:KI270754.1 LN:40191
@SQ SN:KI270710.1 LN:40176
@SQ SN:KI270717.1 LN:40062
@SQ SN:KI270724.1 LN:39555
@SQ SN:KI270720.1 LN:39050
@SQ SN:KI270723.1 LN:38115
@SQ SN:KI270718.1 LN:38054
@SQ SN:KI270317.1 LN:37690
@SQ SN:KI270740.1 LN:37240
@SQ SN:KI270755.1 LN:36723
@SQ SN:KI270707.1 LN:32032
@SQ SN:KI270579.1 LN:31033
@SQ SN:KI270752.1 LN:27745
@SQ SN:KI270512.1 LN:22689
@SQ SN:KI270322.1 LN:21476
@SQ SN:GL000226.1 LN:15008
@SQ SN:KI270311.1 LN:12399
@SQ SN:KI270366.1 LN:8320
@SQ SN:KI270511.1 LN:8127
@SQ SN:KI270448.1 LN:7992
@SQ SN:KI270521.1 LN:7642
@SQ SN:KI270581.1 LN:7046
@SQ SN:KI270582.1 LN:6504
@SQ SN:KI270515.1 LN:6361
@SQ SN:KI270588.1 LN:6158
@SQ SN:KI270591.1 LN:5796
@SQ SN:KI270522.1 LN:5674
@SQ SN:KI270507.1 LN:5353
@SQ SN:KI270590.1 LN:4685
@SQ SN:KI270584.1 LN:4513
@SQ SN:KI270320.1 LN:4416
@SQ SN:KI270382.1 LN:4215
@SQ SN:KI270468.1 LN:4055
@SQ SN:KI270467.1 LN:3920
@SQ SN:KI270362.1 LN:3530
@SQ SN:KI270517.1 LN:3253
@SQ SN:KI270593.1 LN:3041
@SQ SN:KI270528.1 LN:2983
@SQ SN:KI270587.1 LN:2969
@SQ SN:KI270364.1 LN:2855
@SQ SN:KI270371.1 LN:2805
@SQ SN:KI270333.1 LN:2699
@SQ SN:KI270374.1 LN:2656
@SQ SN:KI270411.1 LN:2646
@SQ SN:KI270414.1 LN:2489
@SQ SN:KI270510.1 LN:2415
@SQ SN:KI270390.1 LN:2387
@SQ SN:KI270375.1 LN:2378
@SQ SN:KI270420.1 LN:2321
@SQ SN:KI270509.1 LN:2318
@SQ SN:KI270315.1 LN:2276
@SQ SN:KI270302.1 LN:2274
@SQ SN:KI270518.1 LN:2186
@SQ SN:KI270530.1 LN:2168
@SQ SN:KI270304.1 LN:2165
@SQ SN:KI270418.1 LN:2145
@SQ SN:KI270424.1 LN:2140
@SQ SN:KI270417.1 LN:2043
@SQ SN:KI270508.1 LN:1951
@SQ SN:KI270303.1 LN:1942
@SQ SN:KI270381.1 LN:1930
@SQ SN:KI270529.1 LN:1899
@SQ SN:KI270425.1 LN:1884
@SQ SN:KI270396.1 LN:1880
@SQ SN:KI270363.1 LN:1803
@SQ SN:KI270386.1 LN:1788
@SQ SN:KI270465.1 LN:1774
@SQ SN:KI270383.1 LN:1750
@SQ SN:KI270384.1 LN:1658
@SQ SN:KI270330.1 LN:1652
@SQ SN:KI270372.1 LN:1650
@SQ SN:KI270548.1 LN:1599
@SQ SN:KI270580.1 LN:1553
@SQ SN:KI270387.1 LN:1537
@SQ SN:KI270391.1 LN:1484
@SQ SN:KI270305.1 LN:1472
@SQ SN:KI270373.1 LN:1451
@SQ SN:KI270422.1 LN:1445
@SQ SN:KI270316.1 LN:1444
@SQ SN:KI270340.1 LN:1428
@SQ SN:KI270338.1 LN:1428
@SQ SN:KI270583.1 LN:1400
@SQ SN:KI270334.1 LN:1368
@SQ SN:KI270429.1 LN:1361
@SQ SN:KI270393.1 LN:1308
@SQ SN:KI270516.1 LN:1300
@SQ SN:KI270389.1 LN:1298
@SQ SN:KI270466.1 LN:1233
@SQ SN:KI270388.1 LN:1216
@SQ SN:KI270544.1 LN:1202
@SQ SN:KI270310.1 LN:1201
@SQ SN:KI270412.1 LN:1179
@SQ SN:KI270395.1 LN:1143
@SQ SN:KI270376.1 LN:1136
@SQ SN:KI270337.1 LN:1121
@SQ SN:KI270335.1 LN:1048
@SQ SN:KI270378.1 LN:1048
@SQ SN:KI270379.1 LN:1045
@SQ SN:KI270329.1 LN:1040
@SQ SN:KI270419.1 LN:1029
@SQ SN:KI270336.1 LN:1026
@SQ SN:KI270312.1 LN:998
@SQ SN:KI270539.1 LN:993
@SQ SN:KI270385.1 LN:990
@SQ SN:KI270423.1 LN:981
@SQ SN:KI270392.1 LN:971
@SQ SN:KI270394.1 LN:970
@RG ID:out_sorted SM:out_sorted LB:0.1 PU:out_sorted PL:ILLUMINA
@PG PN:bwa ID:bwa VN:0.7.17-r1198-dirty CL:/risapps/rhel7/bwa/0.7.17/bwa mem -t24 -T0 /rsrch3/scratch/bcb/jdou1/PM1645_CRISPR/hg38/genome test.CB.fastq
@PG ID:samtools PN:samtools PP:bwa VN:1.18 CL:samtools view -H chr20.maester_scRNA.bam
The preprocessing step appears to run fine:
$ python ${path}/src/Monopogen.py preProcess \
> -b bam.lst \
> -o bm \
> -a ${path}/apps \
> -t 8
[2023-12-12 12:22:01,044] INFO Monopogen.py Performing data preprocess before variant calling...
[2023-12-12 12:22:01,044] INFO germline.py Parameters in effect:
[2023-12-12 12:22:01,044] INFO germline.py --subcommand = [preProcess]
[2023-12-12 12:22:01,044] INFO germline.py --bamFile = [bam.lst]
[2023-12-12 12:22:01,044] INFO germline.py --out = [bm]
[2023-12-12 12:22:01,044] INFO germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:22:01,044] INFO germline.py --max_mismatch = [3]
[2023-12-12 12:22:01,044] INFO germline.py --nthreads = [8]
[2023-12-12 12:22:01,055] DEBUG Monopogen.py PreProcessing sample bm
[2023-12-12 12:22:01,089] INFO germline.py The contig chr4 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,090] INFO germline.py The contig chr5 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,090] INFO germline.py The contig chr6 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,091] INFO germline.py The contig chr1 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,092] INFO germline.py The contig chr2 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,094] INFO germline.py The contig chr7 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,094] INFO germline.py The contig chr3 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,096] INFO germline.py The contig chr8 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,173] INFO germline.py The contig chr9 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,176] INFO germline.py The contig chr10 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,178] INFO germline.py The contig chr11 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,179] INFO germline.py The contig chr12 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,180] INFO germline.py The contig chr14 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,183] INFO germline.py The contig chr13 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,188] INFO germline.py The contig chr15 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,191] INFO germline.py The contig chr16 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,243] INFO germline.py The contig chr18 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,249] INFO germline.py The contig chr17 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,250] INFO germline.py The contig chr19 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,254] INFO germline.py The contig chr20 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,257] INFO germline.py The contig chr21 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:22:01,264] INFO germline.py The contig chr22 does not contain the prefix 'chr' and we will add 'chr' on it
[2023-12-12 12:24:17,990] INFO Monopogen.py Success! See instructions above.
The germline step emits lots of errors / warnings but still reports that it ran successfully:
$ python ${path}/src/Monopogen.py germline \
> -a ${path}/apps \
> -r region.lst \
> -p ./ \
> -g chr20_2Mb.hg38.fa \
> -m 3 \
> -t 8 \
> -s all \
> -o bm
[2023-12-12 12:31:53,984] INFO Monopogen.py Performing germline variant calling...
[2023-12-12 12:31:53,985] INFO germline.py Parameters in effect:
[2023-12-12 12:31:53,985] INFO germline.py --subcommand = [germline]
[2023-12-12 12:31:53,985] INFO germline.py --region = [region.lst]
[2023-12-12 12:31:53,985] INFO germline.py --step = [all]
[2023-12-12 12:31:53,985] INFO germline.py --out = [bm]
[2023-12-12 12:31:53,985] INFO germline.py --reference = [chr20_2Mb.hg38.fa]
[2023-12-12 12:31:53,985] INFO germline.py --imputation_panel = [./]
[2023-12-12 12:31:53,985] INFO germline.py --max_softClipped = [3]
[2023-12-12 12:31:53,985] INFO germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:31:53,985] INFO germline.py --nthreads = [8]
[2023-12-12 12:31:53,985] INFO germline.py --norun = [FALSE]
[2023-12-12 12:31:53,985] INFO Monopogen.py Checking existence of essenstial resource files...
[2023-12-12 12:31:54,050] INFO Monopogen.py Checking dependencies...
['bash bm/Script/runGermline_chr20.sh']
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Reference allele mismatch at chr20:3000001 .. REF_SEQ:'A' vs VCF:'N'
[W::vcf_parse] Contig 'i' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:3810050 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'A' is not defined in the header
[W::vcf_parse] INFO '0' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0'
[W::vcf_parse] Contig '|' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '50,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=50,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '50,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5046788 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '213,96,0:32' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=213,96,0:32,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '213,96,0:32'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5847394 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '48,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=48,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '48,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:8188931 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0'
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:17944845 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'T' is not defined in the header
[W::vcf_parse] Contig '?=' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:25478115 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?=:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:33389937 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '187,27,0:9' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=187,27,0:9,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '187,27,0:9'
[E::bcf_write] Broken VCF record, the number of columns at chr20:34805043 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '?' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
[E::bcf_write] Broken VCF record, the number of columns at chr20:36572282 does not match the number of samples (0 vs 1)
Error: VCF parse error
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023
Command line: java -Xmx20480m -jar beagle.jar
gl=bm/germline/chr20.gl.vcf.gz
ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chrom=chr20
out=bm/germline/chr20.gp
impute=false
modelscale=2
nthreads=24
gprobs=true
niterations=0
No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:737)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:919)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at vcf.VcfIt.fillEmissionBuffer(VcfIt.java:307)
at vcf.VcfIt.next(VcfIt.java:363)
at vcf.VcfIt.next(VcfIt.java:52)
at vcf.IntervalVcfIt.readNextRecord(IntervalVcfIt.java:110)
at vcf.IntervalVcfIt.next(IntervalVcfIt.java:92)
at vcf.IntervalVcfIt.next(IntervalVcfIt.java:36)
at main.Main.restrictToVcfMarkers(Main.java:343)
at main.Main.allData(Main.java:313)
at main.Main.main(Main.java:111)
Caused by: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
at vcf.VcfRecord.fieldCountError(VcfRecord.java:221)
at vcf.VcfRecord.delimiters(VcfRecord.java:203)
at vcf.VcfRecord.<init>(VcfRecord.java:87)
at vcf.VcfRecord.fromGTGL(VcfRecord.java:193)
at vcf.VcfIt.lambda$static$5(VcfIt.java:76)
at vcf.VcfIt.lambda$new$8(VcfIt.java:192)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952)
at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926)
at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
gzip: bm/germline/chr20.gp.vcf.gz: No such file or directory
bm/germline/chr20.gp.vcf.gz: No such file or directory
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023
Command line: java -Xmx20480m -jar beagle.jar
gt=bm/germline/chr20.germline.vcf
ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chrom=chr20
out=bm/germline/chr20.phased
impute=false
modelscale=2
nthreads=24
gprobs=true
niterations=0
No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: bm/germline/chr20.germline.vcf
null
at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
at vcf.VcfHeader.<init>(VcfHeader.java:119)
at vcf.VcfIt.<init>(VcfIt.java:190)
at vcf.VcfIt.create(VcfIt.java:175)
at vcf.VcfIt.create(VcfIt.java:150)
at main.Main.allData(Main.java:297)
at main.Main.main(Main.java:111)
[2023-12-12 12:36:05,744] INFO Monopogen.py Success! See instructions above.
Then when I run the somatic featureInfo step, I get an error:
$ python ${path}/src/Monopogen.py somatic \
> -a ${path}/apps \
> -r region.lst \
> -t 50 \
> -i bm \
> -l CB_7K.maester_scRNA.csv \
> -s featureInfo \
> -g chr20_2Mb.hg38.fa
[2023-12-12 12:38:06,217] INFO Monopogen.py Get feature information from sequencing data...
[E::hts_open_format] Failed to open file "bm/germline/chr20.phased.vcf.gz" : No such file or directory
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/nfs/team205/at31/software/Monopogen/src/somatic.py", line 88, in featureInfo
vcf_in = VariantFile(out + "/germline/" + region + ".phased.vcf.gz")
File "pysam/libcbcf.pyx", line 4117, in pysam.libcbcf.VariantFile.__init__
File "pysam/libcbcf.pyx", line 4342, in pysam.libcbcf.VariantFile.open
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 436, in <module>
main()
File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 429, in main
args.func(args)
File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 150, in somatic
result = pool.map(featureInfo, joblst)
File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory
Are you familiar with this issue? I re-downloaded chr20.maester_scRNA.bam
multiple times to make sure it wasn't somehow truncated during the download, but the issue persists. Do you have any idea what I'm doing wrong?
Thanks so much in advance!