hifiasm
hifiasm copied to clipboard
Questions about L, A, S -Lines in GFA format.
Hello, I'm look into the meaning of the columns of the lines in GFA files. Although I've checked the links you recommended, such as https://hifiasm.readthedocs.io/en/latest/interpreting-output.html and https://gfa-spec.github.io/GFA-spec/GFA1.html, But I didn't find what I want.. Ok, we start with the S-lines, for example:
S utg007305l * LN:i:15672 rd:i:0
S utg007306l * LN:i:15650 rd:i:0
S utg007307l * LN:i:19230 rd:i:5
S utg007308l * LN:i:17406 rd:i:0
S utg007309l * LN:i:17315 rd:i:1
Here, for Line 1, what's the meaning of rd:i:0 ?
A h2tg000001l 0 + m84128_231027_101108_s1/174523732/ccs 0 16932 id:i:2582924 HG:A:m
A h2tg000001l 3029 + m84128_231027_101108_s1/245106603/ccs 0 16949 id:i:1319739 HG:A:m
A h2tg000001l 5642 + m84128_231027_101108_s1/86315159/ccs 0 16365 id:i:942834 HG:A:m
A h2tg000001l 10180 - m84128_231027_101108_s1/14877412/ccs 0 14738 id:i:1975099 HG:A:m
A h2tg000001l 12669 + m84128_231027_101108_s1/60233120/ccs 0 15529 id:i:1191506 HG:A:m
A h2tg000001l 12846 - m84128_231027_101108_s1/190974776/ccs 0 15898 id:i:2621153 HG:A:m
A h2tg000001l 12882 + m84128_231027_101108_s1/262801143/ccs 0 17446 id:i:1396225 HG:A:m
For A lines, Line 2 means the read is 13921 bp long (16949-3029+1), so on and so forth? If this is true, the remaining lines indicate the read length is relatively short? For example, Line 5, the reads is only 2861 bp ? (15529-12669+1 ), well, N50 of my reads should be ~ 15Kb.
For L lines:
L h2tg000247c - h2tg000247c - 16912M L1:i:0 L2:i:0
L h2tg000251l + h2tg000048l - 15760M L1:i:20881 L2:i:0
L h2tg000264l + h2tg000084l - 16163M L1:i:15957 L2:i:0
L h2tg000264l + h2tg000119l - 15044M L1:i:17076 L2:i:0
L h2tg000270l + h2tg000083l - 14219M L1:i:20778 L2:i:0
For my data the final column is always L2:i:0, what is that mean? Also Line 2 means h2tg000251l and h2tg000048l have 15760 bp overlap? is that 100% exact match? Then why not connect and extend the contig? and what does L1:i:20881mean, the start position of the alignment in h2tg000251l ?
Sorry for these naive questions and thanks very much and looking forward to your reply!
Best, Tao
Similar to this issue, I can't figure out how the coverage (rd:i:n) is calculated in the hap*.p.ctg.gfa file.(or any .gfa output file). Is it the average number of hifi reads (A line) per base across the contig sequence (L line)? For example below, how can the contig h2tg000334l have coverage of 1 (n=0 +1) with 5 somewhat overlapping reads while h2tg000335l has a coverage of 9 (n=8+1) but only 8 reads? Am I missing something? Thanks for your help and patience!
S h2tg000334l * LN:i:11407 rd:i:0 A h2tg000334l 0 - m64175e_230414_121819/59835679/ccs 0 4723 id:i:85459 HG:A:m A h2tg000334l 4063 - m64175e_230414_121819/100666093/ccs 0 3814 id:i:142013 HG:A:m A h2tg000334l 5099 + m64175e_230414_121819/14877872/ccs 0 3590 id:i:21584 HG:A:m A h2tg000334l 6681 + m64175e_230414_121819/75500103/ccs 0 3734 id:i:107579 HG:A:m A h2tg000334l 7463 - m64175e_230414_121819/148113359/ccs 0 3944 id:i:205105 HG:A:m S h2tg000335l * LN:i:11276 rd:i:8 A h2tg000335l 0 + m64175e_230414_121819/163512842/ccs 0 5386 id:i:226118 HG:A:m A h2tg000335l 146 + m64175e_230414_121819/117310622/ccs 0 5366 id:i:164256 HG:A:m A h2tg000335l 1282 - m64175e_230414_121819/96929260/ccs 0 5127 id:i:136940 HG:A:m A h2tg000335l 2640 - m64175e_230414_121819/116656107/ccs 0 4364 id:i:163405 HG:A:m A h2tg000335l 3932 - m64175e_230414_121819/81986565/ccs 0 3119 id:i:116413 HG:A:m A h2tg000335l 4128 + m64175e_230414_121819/14878721/ccs 0 3418 id:i:21607 HG:A:m A h2tg000335l 5376 - m64175e_230414_121819/65339592/ccs 0 4843 id:i:93361 HG:A:m A h2tg000335l 5876 - m64175e_230414_121819/118162350/ccs 0 5400 id:i:165389 HG:A:m
@jwinternitz I think the .gfa file is reporting only the reads used to built the assembly. duplicated or completely overlapping reads are not present. I tried to draw a plot with the read coverage and it was basically flat, even in the contigs that are marked with high coverage values
Ok, that makes sense. I was just hoping there may have been an output file with all the reads assigned to each contig, to check for potential chimeric contigs based on coverage regions across the sequence. I know that one can map the reads back to contigs using, for example, minimap2, but I'm not confident the same reads used to construct the contig would be present. Thanks for your reply!
Me too! I was looking for that, we probably need to align the reads again on the assembly to get the detailed coverage base by base
And then the question is, should we filter the mapped reads to keep only primary alignments, excluding secondary and supplementary alignments? I think so, but it would be nice to confirm with the alignments used by hifiasm to construct the contigs.
I agree with you, detailed coverage information could be interesting for the analysis of tandem repeat regions and rDNA