cDNA_Cupcake
cDNA_Cupcake copied to clipboard
[feature] add per-exon identity filter to collapse_isoforms_by_sam.py
Currently, collapse_isoforms_by_sam.py
accepts a -i
identity threshold, but that is the average identity over the whole alignment.
Importantly, I want to add a per-exon identity threshold to filter out cases where subregions of low quality / err causes bad alignments. These should be thrown out.
chr1 hg38 exon 247416157 247416193 100 + . ID=c14289.mrna1.exon1;Name=c14289;Parent=c14289.mrna1;Target=c14289 1 37 +
chr1 hg38 exon 247418053 247419077 99 + . ID=c14289.mrna1.exon2;Name=c14289;Parent=c14289.mrna1;Target=c14289 38 1062 +
chr1 hg38 exon 247423230 247423349 100 + . ID=c14289.mrna1.exon3;Name=c14289;Parent=c14289.mrna1;Target=c14289 1063 1182 +
chr1 hg38 exon 247423847 247425599 99 + . ID=c14289.mrna1.exon4;Name=c14289;Parent=c14289.mrna1;Target=c14289 1183 2935 +
chr1 hg38 exon 247434103 247434273 100 + . ID=c14289.mrna1.exon5;Name=c14289;Parent=c14289.mrna1;Target=c14289 2936 3106 +
chr1 hg38 exon 247442199 247442204 100 + . ID=c14289.mrna1.exon6;Name=c14289;Parent=c14289.mrna1;Target=c14289 3108 3113 +
chr1 hg38 exon 247442230 247442237 100 + . ID=c14289.mrna1.exon7;Name=c14289;Parent=c14289.mrna1;Target=c14289 3134 3141 +
chr1 hg38 exon 247444016 247444092 74 + . ID=c14289.mrna1.exon8;Name=c14289;Parent=c14289.mrna1;Target=c14289 3150 3210 +
chr1 hg38 exon 247444749 247444821 87 + . ID=c14289.mrna1.exon9;Name=c14289;Parent=c14289.mrna1;Target=c14289 3211 3275 +
chr1 hg38 exon 247448405 247448815 100 + . ID=c14289.mrna1.exon10;Name=c14289;Parent=c14289.mrna1;Target=c14289 3276 3686 +
As an example above, most exons have high identity 99-100% but two exons have very low identity. This is a case where collapse script should discard this alignment.
--Liz
hello, If i don't add --dun-merge-5-shorter parameter when run collapse_isoforms_by_sam.py, would the script collapse shorter 5' transcripts? If so, can it get the same results though i don't run filter_away_subset.py ? Thanks!
Hi @qiuyixmm ,
Very good questions and you are correct!
case 1:
collapse_isoforms_by_sam.py
is equivalent to
case 2:
collapse_isoforms_by_sam.py --dun-merge-5-shorter
filter_away_subset.py
You may ask, why bother with --dun-merge-5-shorter
then? It' because if then run get_abundance_post_collapse.py
after the collapse step, the counts will be different. In case 1, the counts will include reads from both longer and shorter merged transcripts; in case 2; the counts will correctly reflect the longer and shorter as separate.
@Magdoll okay, it is vey kind of you to give me so detailed answer. Thanks!