Monopogen icon indicating copy to clipboard operation
Monopogen copied to clipboard

Issues in the test file chr20.maester_scRNA.bam for somatic mutation calling

Open alextidd opened this issue 6 months ago • 6 comments

Hi, Thanks so much for developing this tool! I've been trying to get the test dataset running to call somatic mutations, but have been having some trouble. I downloaded the file chr20.maester_scRNA.bam from https://drive.google.com/file/d/1nS2rjrab-QSiq-FhpTWOtJesCE9iS_0k/view?usp=share_link, but it appears to be broken.

I ran the code in this directory:

$ ls 
bam.lst
CB_7K.maester_scRNA.csv
CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
chr20_2Mb.hg38.fa
chr20_2Mb.hg38.fa.fai
chr20.maester_scRNA.bam
chr20.maester_scRNA.bam.bai
region.lst

The downloaded chr20.maester_scRNA.bam file looks like this:

$ samtools view chr20.maester_scRNA.bam | head -2

SRR15598776.294805248_TCTTGCGCAGGCACAC_CTCTCCTCGCCA	2064	20	60218	0	10H20M61H	*	0	0	GCAATCCTTTCCTCTCCATT	,,:,:,,,,,,,,,,,,::F	NM:i:0	MD:Z:20	AS:i:20	XS:i:19	SA:Z:5,113545723,-,22S37M32S,0,0;5,156763440,+,22S19M50S,0,0;	XA:Z:11,+88315939,60S19M12S,0;
SRR15598776.80450257_AATGAAGTCGCGGACT_AAGTATACGCTC	16	20	61569	0	43S19M29S	*	0	0	TATACATATGTCCCCTCCCTCTTGAATCTCACCTTCTACCCCCGACTCCATTCCACTCCTCTAGGTTGTCACAGAGCACCGCATTTGGCTC	FF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:19	AS:i:19	XS:i:19	SA:Z:5,64380457,-,19S19M53S,0,0;15,76070764,-,5S19M67S,0,0;X,10097594,-,67S19M5S,0,0;	XA:Z:Y,-10959620,44S19M28S,0;KI270736.1,+62338,28S19M44S,0;

$ samtools view chr20.maester_scRNA.bam | tail -2

SRR15598776.418134198_TTTCATGAGCAACTTC_TCACGTTTCCTA	2064	20	39999757	0	25H24M42H	*	0	0	ATATTTAAAATTTTATTTTTTAAT	,,:FF,:,FF,,::FF,,,FF,:,	NM:i:0	MD:Z:24	AS:i:24	XS:i:23	SA:Z:X,46457624,-,3M1I25M62S,0,1;6,124139067,-,53S23M15S,0,0;7,129403981,-,66S22M3S,0,0;8,102417610,+,30S20M41S,0,0;
SRR15598776.180926021_GACTCAAAGTTCACTG_ACCATATAAAAT	2048	20	39999869	0	70H21M	*	0	0	ATTTACTATAAAAATAGTTAC	F,,:,,,,,:::,,,,,,,,,	NM:i:1	MD:Z:20T0	AS:i:20	XS:i:20	SA:Z:1,82576961,-,21S39M31S,0,0;4,180292349,+,20S19M52S,0,0;	XA:Z:5,-14292491,7S25M59S,1;11,+95173927,67S19M5S,0;14,-21926828,8S19M64S,0;

$ samtools view -H chr20.maester_scRNA.bam

@HD	VN:1.3	SO:coordinate
@SQ	SN:1	LN:248956422
@SQ	SN:10	LN:133797422
@SQ	SN:11	LN:135086622
@SQ	SN:12	LN:133275309
@SQ	SN:13	LN:114364328
@SQ	SN:14	LN:107043718
@SQ	SN:15	LN:101991189
@SQ	SN:16	LN:90338345
@SQ	SN:17	LN:83257441
@SQ	SN:18	LN:80373285
@SQ	SN:19	LN:58617616
@SQ	SN:2	LN:242193529
@SQ	SN:20	LN:64444167
@SQ	SN:21	LN:46709983
@SQ	SN:22	LN:50818468
@SQ	SN:3	LN:198295559
@SQ	SN:4	LN:190214555
@SQ	SN:5	LN:181538259
@SQ	SN:6	LN:170805979
@SQ	SN:7	LN:159345973
@SQ	SN:8	LN:145138636
@SQ	SN:9	LN:138394717
@SQ	SN:MT	LN:16569
@SQ	SN:X	LN:156040895
@SQ	SN:Y	LN:57227415
@SQ	SN:KI270728.1	LN:1872759
@SQ	SN:KI270727.1	LN:448248
@SQ	SN:KI270442.1	LN:392061
@SQ	SN:KI270729.1	LN:280839
@SQ	SN:GL000225.1	LN:211173
@SQ	SN:KI270743.1	LN:210658
@SQ	SN:GL000008.2	LN:209709
@SQ	SN:GL000009.2	LN:201709
@SQ	SN:KI270747.1	LN:198735
@SQ	SN:KI270722.1	LN:194050
@SQ	SN:GL000194.1	LN:191469
@SQ	SN:KI270742.1	LN:186739
@SQ	SN:GL000205.2	LN:185591
@SQ	SN:GL000195.1	LN:182896
@SQ	SN:KI270736.1	LN:181920
@SQ	SN:KI270733.1	LN:179772
@SQ	SN:GL000224.1	LN:179693
@SQ	SN:GL000219.1	LN:179198
@SQ	SN:KI270719.1	LN:176845
@SQ	SN:GL000216.2	LN:176608
@SQ	SN:KI270712.1	LN:176043
@SQ	SN:KI270706.1	LN:175055
@SQ	SN:KI270725.1	LN:172810
@SQ	SN:KI270744.1	LN:168472
@SQ	SN:KI270734.1	LN:165050
@SQ	SN:GL000213.1	LN:164239
@SQ	SN:GL000220.1	LN:161802
@SQ	SN:KI270715.1	LN:161471
@SQ	SN:GL000218.1	LN:161147
@SQ	SN:KI270749.1	LN:158759
@SQ	SN:KI270741.1	LN:157432
@SQ	SN:GL000221.1	LN:155397
@SQ	SN:KI270716.1	LN:153799
@SQ	SN:KI270731.1	LN:150754
@SQ	SN:KI270751.1	LN:150742
@SQ	SN:KI270750.1	LN:148850
@SQ	SN:KI270519.1	LN:138126
@SQ	SN:GL000214.1	LN:137718
@SQ	SN:KI270708.1	LN:127682
@SQ	SN:KI270730.1	LN:112551
@SQ	SN:KI270438.1	LN:112505
@SQ	SN:KI270737.1	LN:103838
@SQ	SN:KI270721.1	LN:100316
@SQ	SN:KI270738.1	LN:99375
@SQ	SN:KI270748.1	LN:93321
@SQ	SN:KI270435.1	LN:92983
@SQ	SN:GL000208.1	LN:92689
@SQ	SN:KI270538.1	LN:91309
@SQ	SN:KI270756.1	LN:79590
@SQ	SN:KI270739.1	LN:73985
@SQ	SN:KI270757.1	LN:71251
@SQ	SN:KI270709.1	LN:66860
@SQ	SN:KI270746.1	LN:66486
@SQ	SN:KI270753.1	LN:62944
@SQ	SN:KI270589.1	LN:44474
@SQ	SN:KI270726.1	LN:43739
@SQ	SN:KI270735.1	LN:42811
@SQ	SN:KI270711.1	LN:42210
@SQ	SN:KI270745.1	LN:41891
@SQ	SN:KI270714.1	LN:41717
@SQ	SN:KI270732.1	LN:41543
@SQ	SN:KI270713.1	LN:40745
@SQ	SN:KI270754.1	LN:40191
@SQ	SN:KI270710.1	LN:40176
@SQ	SN:KI270717.1	LN:40062
@SQ	SN:KI270724.1	LN:39555
@SQ	SN:KI270720.1	LN:39050
@SQ	SN:KI270723.1	LN:38115
@SQ	SN:KI270718.1	LN:38054
@SQ	SN:KI270317.1	LN:37690
@SQ	SN:KI270740.1	LN:37240
@SQ	SN:KI270755.1	LN:36723
@SQ	SN:KI270707.1	LN:32032
@SQ	SN:KI270579.1	LN:31033
@SQ	SN:KI270752.1	LN:27745
@SQ	SN:KI270512.1	LN:22689
@SQ	SN:KI270322.1	LN:21476
@SQ	SN:GL000226.1	LN:15008
@SQ	SN:KI270311.1	LN:12399
@SQ	SN:KI270366.1	LN:8320
@SQ	SN:KI270511.1	LN:8127
@SQ	SN:KI270448.1	LN:7992
@SQ	SN:KI270521.1	LN:7642
@SQ	SN:KI270581.1	LN:7046
@SQ	SN:KI270582.1	LN:6504
@SQ	SN:KI270515.1	LN:6361
@SQ	SN:KI270588.1	LN:6158
@SQ	SN:KI270591.1	LN:5796
@SQ	SN:KI270522.1	LN:5674
@SQ	SN:KI270507.1	LN:5353
@SQ	SN:KI270590.1	LN:4685
@SQ	SN:KI270584.1	LN:4513
@SQ	SN:KI270320.1	LN:4416
@SQ	SN:KI270382.1	LN:4215
@SQ	SN:KI270468.1	LN:4055
@SQ	SN:KI270467.1	LN:3920
@SQ	SN:KI270362.1	LN:3530
@SQ	SN:KI270517.1	LN:3253
@SQ	SN:KI270593.1	LN:3041
@SQ	SN:KI270528.1	LN:2983
@SQ	SN:KI270587.1	LN:2969
@SQ	SN:KI270364.1	LN:2855
@SQ	SN:KI270371.1	LN:2805
@SQ	SN:KI270333.1	LN:2699
@SQ	SN:KI270374.1	LN:2656
@SQ	SN:KI270411.1	LN:2646
@SQ	SN:KI270414.1	LN:2489
@SQ	SN:KI270510.1	LN:2415
@SQ	SN:KI270390.1	LN:2387
@SQ	SN:KI270375.1	LN:2378
@SQ	SN:KI270420.1	LN:2321
@SQ	SN:KI270509.1	LN:2318
@SQ	SN:KI270315.1	LN:2276
@SQ	SN:KI270302.1	LN:2274
@SQ	SN:KI270518.1	LN:2186
@SQ	SN:KI270530.1	LN:2168
@SQ	SN:KI270304.1	LN:2165
@SQ	SN:KI270418.1	LN:2145
@SQ	SN:KI270424.1	LN:2140
@SQ	SN:KI270417.1	LN:2043
@SQ	SN:KI270508.1	LN:1951
@SQ	SN:KI270303.1	LN:1942
@SQ	SN:KI270381.1	LN:1930
@SQ	SN:KI270529.1	LN:1899
@SQ	SN:KI270425.1	LN:1884
@SQ	SN:KI270396.1	LN:1880
@SQ	SN:KI270363.1	LN:1803
@SQ	SN:KI270386.1	LN:1788
@SQ	SN:KI270465.1	LN:1774
@SQ	SN:KI270383.1	LN:1750
@SQ	SN:KI270384.1	LN:1658
@SQ	SN:KI270330.1	LN:1652
@SQ	SN:KI270372.1	LN:1650
@SQ	SN:KI270548.1	LN:1599
@SQ	SN:KI270580.1	LN:1553
@SQ	SN:KI270387.1	LN:1537
@SQ	SN:KI270391.1	LN:1484
@SQ	SN:KI270305.1	LN:1472
@SQ	SN:KI270373.1	LN:1451
@SQ	SN:KI270422.1	LN:1445
@SQ	SN:KI270316.1	LN:1444
@SQ	SN:KI270340.1	LN:1428
@SQ	SN:KI270338.1	LN:1428
@SQ	SN:KI270583.1	LN:1400
@SQ	SN:KI270334.1	LN:1368
@SQ	SN:KI270429.1	LN:1361
@SQ	SN:KI270393.1	LN:1308
@SQ	SN:KI270516.1	LN:1300
@SQ	SN:KI270389.1	LN:1298
@SQ	SN:KI270466.1	LN:1233
@SQ	SN:KI270388.1	LN:1216
@SQ	SN:KI270544.1	LN:1202
@SQ	SN:KI270310.1	LN:1201
@SQ	SN:KI270412.1	LN:1179
@SQ	SN:KI270395.1	LN:1143
@SQ	SN:KI270376.1	LN:1136
@SQ	SN:KI270337.1	LN:1121
@SQ	SN:KI270335.1	LN:1048
@SQ	SN:KI270378.1	LN:1048
@SQ	SN:KI270379.1	LN:1045
@SQ	SN:KI270329.1	LN:1040
@SQ	SN:KI270419.1	LN:1029
@SQ	SN:KI270336.1	LN:1026
@SQ	SN:KI270312.1	LN:998
@SQ	SN:KI270539.1	LN:993
@SQ	SN:KI270385.1	LN:990
@SQ	SN:KI270423.1	LN:981
@SQ	SN:KI270392.1	LN:971
@SQ	SN:KI270394.1	LN:970
@RG	ID:out_sorted	SM:out_sorted	LB:0.1	PU:out_sorted	PL:ILLUMINA
@PG	PN:bwa	ID:bwa	VN:0.7.17-r1198-dirty	CL:/risapps/rhel7/bwa/0.7.17/bwa mem -t24 -T0 /rsrch3/scratch/bcb/jdou1/PM1645_CRISPR/hg38/genome test.CB.fastq
@PG	ID:samtools	PN:samtools	PP:bwa	VN:1.18	CL:samtools view -H chr20.maester_scRNA.bam

The preprocessing step appears to run fine:

$ python ${path}/src/Monopogen.py  preProcess \
>    -b bam.lst \
>    -o bm \
>    -a ${path}/apps \
>    -t 8

[2023-12-12 12:22:01,044] INFO     Monopogen.py Performing data preprocess before variant calling...
[2023-12-12 12:22:01,044] INFO     germline.py Parameters in effect:
[2023-12-12 12:22:01,044] INFO     germline.py --subcommand = [preProcess]
[2023-12-12 12:22:01,044] INFO     germline.py --bamFile = [bam.lst]
[2023-12-12 12:22:01,044] INFO     germline.py --out = [bm]
[2023-12-12 12:22:01,044] INFO     germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:22:01,044] INFO     germline.py --max_mismatch = [3]
[2023-12-12 12:22:01,044] INFO     germline.py --nthreads = [8]
[2023-12-12 12:22:01,055] DEBUG    Monopogen.py PreProcessing sample bm
[2023-12-12 12:22:01,089] INFO     germline.py The contig chr4 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,090] INFO     germline.py The contig chr5 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,090] INFO     germline.py The contig chr6 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,091] INFO     germline.py The contig chr1 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,092] INFO     germline.py The contig chr2 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,094] INFO     germline.py The contig chr7 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,094] INFO     germline.py The contig chr3 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,096] INFO     germline.py The contig chr8 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,173] INFO     germline.py The contig chr9 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,176] INFO     germline.py The contig chr10 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,178] INFO     germline.py The contig chr11 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,179] INFO     germline.py The contig chr12 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,180] INFO     germline.py The contig chr14 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,183] INFO     germline.py The contig chr13 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,188] INFO     germline.py The contig chr15 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,191] INFO     germline.py The contig chr16 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,243] INFO     germline.py The contig chr18 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,249] INFO     germline.py The contig chr17 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,250] INFO     germline.py The contig chr19 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,254] INFO     germline.py The contig chr20 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,257] INFO     germline.py The contig chr21 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:22:01,264] INFO     germline.py The contig chr22 does not contain the prefix 'chr' and we will add 'chr' on it 
[2023-12-12 12:24:17,990] INFO     Monopogen.py Success! See instructions above.

The germline step emits lots of errors / warnings but still reports that it ran successfully:

$ python  ${path}/src/Monopogen.py  germline  \
>    -a   ${path}/apps \
>    -r  region.lst \
>    -p  ./ \
>    -g  chr20_2Mb.hg38.fa \
>    -m 3 \
>    -t 8 \
>    -s all \
>    -o bm

[2023-12-12 12:31:53,984] INFO     Monopogen.py Performing germline variant calling...
[2023-12-12 12:31:53,985] INFO     germline.py Parameters in effect:
[2023-12-12 12:31:53,985] INFO     germline.py --subcommand = [germline]
[2023-12-12 12:31:53,985] INFO     germline.py --region = [region.lst]
[2023-12-12 12:31:53,985] INFO     germline.py --step = [all]
[2023-12-12 12:31:53,985] INFO     germline.py --out = [bm]
[2023-12-12 12:31:53,985] INFO     germline.py --reference = [chr20_2Mb.hg38.fa]
[2023-12-12 12:31:53,985] INFO     germline.py --imputation_panel = [./]
[2023-12-12 12:31:53,985] INFO     germline.py --max_softClipped = [3]
[2023-12-12 12:31:53,985] INFO     germline.py --app_path = [/nfs/team205/at31/software/Monopogen/apps]
[2023-12-12 12:31:53,985] INFO     germline.py --nthreads = [8]
[2023-12-12 12:31:53,985] INFO     germline.py --norun = [FALSE]
[2023-12-12 12:31:53,985] INFO     Monopogen.py Checking existence of essenstial resource files...
[2023-12-12 12:31:54,050] INFO     Monopogen.py Checking dependencies...
['bash bm/Script/runGermline_chr20.sh']
[mpileup] 1 samples in 1 input files
(mpileup) Max depth is above 1M. Potential memory hog!
Reference allele mismatch at chr20:3000001 .. REF_SEQ:'A' vs VCF:'N'
[W::vcf_parse] Contig 'i' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:3810050 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at i:0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'A' is not defined in the header
[W::vcf_parse] INFO '0' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=2;IMF=0.0210526;DP=95;I16=0,0,0,1,0,0,22,484,0,0,32,1024,0,0,24,576;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=18;IMF=0.382979;DP=47;I16=0,0,0,2,0,0,66,2178,0,0,110,6050,0,0,18,162;QS=0,1;VDB=0.02;SGB=-0.453602;MQ0F=0'
[W::vcf_parse] Contig '|' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '50,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=50,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '50,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5046788 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '213,96,0:32' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=213,96,0:32,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '213,96,0:32'
[E::bcf_write] Broken VCF record, the number of columns at chr20:5847394 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '48,6,0:2' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=48,6,0:2,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '48,6,0:2'
[E::bcf_write] Broken VCF record, the number of columns at chr20:8188931 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.00952381;DP=105;I16=0,0,77,3,0,0,2815,104141,0,0,4800,288000,0,0,1743,41265;QS=0,1;VDB=0.000484566;SGB=-0.693147;MQSB=1;MQ0F=0'
[W::vcf_parse_format] FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=3;IMF=0.0625;DP=48;I16=0,0,0,14,0,0,355,9005,0,0,840,50400,0,0,28,98;QS=0,1;VDB=2.00699e-07;SGB=-0.686358;MQ0F=0'
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:17944845 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER 'T' is not defined in the header
[W::vcf_parse] Contig '?=' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:25478115 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?=:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[E::bcf_write] Broken VCF record, the number of columns at chr20:33389937 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at ?:0 does not match the number of samples (0 vs 1)
[E::bcf_write] Broken VCF record, the number of columns at :0 does not match the number of samples (0 vs 1)
[W::vcf_parse] Contig '?' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] FILTER '187,27,0:9' is not defined in the header
[E::bcf_hdr_parse_line] Could not parse the header line: "##FILTER=<ID=187,27,0:9,Description="Dummy">"
[E::vcf_parse] Could not add dummy header for FILTER '187,27,0:9'
[E::bcf_write] Broken VCF record, the number of columns at chr20:34805043 does not match the number of samples (0 vs 1)
[W::vcf_parse] FILTER '?' is not defined in the header
[W::vcf_parse_format] FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0' is not defined in the header, assuming Type=String
[E::bcf_hdr_parse_line] Could not parse the header line: "##FORMAT=<ID=INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0,Number=1,Type=String,Description="Dummy">"
[E::vcf_parse_format] Could not add dummy header for FORMAT 'INDEL;IDV=1;IMF=0.2;DP=5;I16=0,0,0,1,0,0,25,625,0,0,60,3600,0,0,11,121;QS=0,1;SGB=-0.379885;MQ0F=0'
[W::bgzf_read_block] EOF marker is absent. The input is probably truncated
[E::bcf_write] Broken VCF record, the number of columns at chr20:36572282 does not match the number of samples (0 vs 1)
Error: VCF parse error
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023

Command line: java -Xmx20480m -jar beagle.jar
  gl=bm/germline/chr20.gl.vcf.gz
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.gp
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:600)
	at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:678)
	at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:737)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:919)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at vcf.VcfIt.fillEmissionBuffer(VcfIt.java:307)
	at vcf.VcfIt.next(VcfIt.java:363)
	at vcf.VcfIt.next(VcfIt.java:52)
	at vcf.IntervalVcfIt.readNextRecord(IntervalVcfIt.java:110)
	at vcf.IntervalVcfIt.next(IntervalVcfIt.java:92)
	at vcf.IntervalVcfIt.next(IntervalVcfIt.java:36)
	at main.Main.restrictToVcfMarkers(Main.java:343)
	at main.Main.allData(Main.java:313)
	at main.Main.main(Main.java:111)
Caused by: java.lang.IllegalArgumentException: VCF header line has 10 fields, but data line has 4 fields
File source:File source: bm/germline/chr20.gl.vcf.gz
[chr20, 2998989, ., A]
	at vcf.VcfRecord.fieldCountError(VcfRecord.java:221)
	at vcf.VcfRecord.delimiters(VcfRecord.java:203)
	at vcf.VcfRecord.<init>(VcfRecord.java:87)
	at vcf.VcfRecord.fromGTGL(VcfRecord.java:193)
	at vcf.VcfIt.lambda$static$5(VcfIt.java:76)
	at vcf.VcfIt.lambda$new$8(VcfIt.java:192)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:952)
	at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:926)
	at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
gzip: bm/germline/chr20.gp.vcf.gz: No such file or directory
bm/germline/chr20.gp.vcf.gz: No such file or directory
beagle.27Jul16.86a.jar (version 4.1)
Copyright (C) 2014-2015 Brian L. Browning
Enter "java -jar beagle.27Jul16.86a.jar" for a summary of command line arguments.
Start time: 12:36 PM GMT on 12 Dec 2023

Command line: java -Xmx20480m -jar beagle.jar
  gt=bm/germline/chr20.germline.vcf
  ref=./CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.filtered.shapeit2-duohmm-phased.vcf.gz
  chrom=chr20
  out=bm/germline/chr20.phased
  impute=false
  modelscale=2
  nthreads=24
  gprobs=true
  niterations=0

No genetic map is specified: using 1 cM = 1 Mb
Exception in thread "main" java.lang.IllegalArgumentException: Missing line (#CHROM ...) after meta-information lines
File source: bm/germline/chr20.germline.vcf
null
	at vcf.VcfHeader.checkHeaderLine(VcfHeader.java:135)
	at vcf.VcfHeader.<init>(VcfHeader.java:119)
	at vcf.VcfIt.<init>(VcfIt.java:190)
	at vcf.VcfIt.create(VcfIt.java:175)
	at vcf.VcfIt.create(VcfIt.java:150)
	at main.Main.allData(Main.java:297)
	at main.Main.main(Main.java:111)
[2023-12-12 12:36:05,744] INFO     Monopogen.py Success! See instructions above.

Then when I run the somatic featureInfo step, I get an error:

$ python ${path}/src/Monopogen.py  somatic  \
>     -a ${path}/apps \
>     -r region.lst \
>     -t 50 \
>     -i bm \
>     -l CB_7K.maester_scRNA.csv \
>     -s featureInfo \
>     -g chr20_2Mb.hg38.fa 

[2023-12-12 12:38:06,217] INFO     Monopogen.py Get feature information from sequencing data...
[E::hts_open_format] Failed to open file "bm/germline/chr20.phased.vcf.gz" : No such file or directory
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/nfs/team205/at31/software/Monopogen/src/somatic.py", line 88, in featureInfo
    vcf_in = VariantFile(out + "/germline/" +  region + ".phased.vcf.gz")
  File "pysam/libcbcf.pyx", line 4117, in pysam.libcbcf.VariantFile.__init__
  File "pysam/libcbcf.pyx", line 4342, in pysam.libcbcf.VariantFile.open
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 436, in <module>
    main()
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 429, in main
    args.func(args)
  File "/nfs/team205/at31/software/Monopogen/src/Monopogen.py", line 150, in somatic
    result = pool.map(featureInfo, joblst)
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/nfs/users/nfs_a/at31/miniforge3/envs/monopogen/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
FileNotFoundError: [Errno 2] could not open variant file `b'bm/germline/chr20.phased.vcf.gz'`: No such file or directory

Are you familiar with this issue? I re-downloaded chr20.maester_scRNA.bam multiple times to make sure it wasn't somehow truncated during the download, but the issue persists. Do you have any idea what I'm doing wrong?

Thanks so much in advance!

alextidd avatar Dec 12 '23 12:12 alextidd