cDNA_Cupcake icon indicating copy to clipboard operation
cDNA_Cupcake copied to clipboard

[feature] add per-exon identity filter to collapse_isoforms_by_sam.py

Open Magdoll opened this issue 7 years ago • 3 comments

Currently, collapse_isoforms_by_sam.py accepts a -i identity threshold, but that is the average identity over the whole alignment.

Importantly, I want to add a per-exon identity threshold to filter out cases where subregions of low quality / err causes bad alignments. These should be thrown out.

chr1    hg38    exon    247416157   247416193   100 +   .   ID=c14289.mrna1.exon1;Name=c14289;Parent=c14289.mrna1;Target=c14289 1 37 +
chr1    hg38    exon    247418053   247419077   99  +   .   ID=c14289.mrna1.exon2;Name=c14289;Parent=c14289.mrna1;Target=c14289 38 1062 +
chr1    hg38    exon    247423230   247423349   100 +   .   ID=c14289.mrna1.exon3;Name=c14289;Parent=c14289.mrna1;Target=c14289 1063 1182 +
chr1    hg38    exon    247423847   247425599   99  +   .   ID=c14289.mrna1.exon4;Name=c14289;Parent=c14289.mrna1;Target=c14289 1183 2935 +
chr1    hg38    exon    247434103   247434273   100 +   .   ID=c14289.mrna1.exon5;Name=c14289;Parent=c14289.mrna1;Target=c14289 2936 3106 +
chr1    hg38    exon    247442199   247442204   100 +   .   ID=c14289.mrna1.exon6;Name=c14289;Parent=c14289.mrna1;Target=c14289 3108 3113 +
chr1    hg38    exon    247442230   247442237   100 +   .   ID=c14289.mrna1.exon7;Name=c14289;Parent=c14289.mrna1;Target=c14289 3134 3141 +
chr1    hg38    exon    247444016   247444092   74  +   .   ID=c14289.mrna1.exon8;Name=c14289;Parent=c14289.mrna1;Target=c14289 3150 3210 +
chr1    hg38    exon    247444749   247444821   87  +   .   ID=c14289.mrna1.exon9;Name=c14289;Parent=c14289.mrna1;Target=c14289 3211 3275 +
chr1    hg38    exon    247448405   247448815   100 +   .   ID=c14289.mrna1.exon10;Name=c14289;Parent=c14289.mrna1;Target=c14289 3276 3686 +

As an example above, most exons have high identity 99-100% but two exons have very low identity. This is a case where collapse script should discard this alignment.

--Liz

Magdoll avatar Sep 27 '17 16:09 Magdoll

hello, If i don't add --dun-merge-5-shorter parameter when run collapse_isoforms_by_sam.py, would the script collapse shorter 5' transcripts? If so, can it get the same results though i don't run filter_away_subset.py ? Thanks!

qiuyixmm avatar Sep 21 '18 16:09 qiuyixmm

Hi @qiuyixmm ,

Very good questions and you are correct!

case 1:

collapse_isoforms_by_sam.py 

is equivalent to

case 2:

collapse_isoforms_by_sam.py --dun-merge-5-shorter
filter_away_subset.py

You may ask, why bother with --dun-merge-5-shorter then? It' because if then run get_abundance_post_collapse.py after the collapse step, the counts will be different. In case 1, the counts will include reads from both longer and shorter merged transcripts; in case 2; the counts will correctly reflect the longer and shorter as separate.

Magdoll avatar Sep 21 '18 20:09 Magdoll

@Magdoll okay, it is vey kind of you to give me so detailed answer. Thanks!

qiuyixmm avatar Sep 22 '18 11:09 qiuyixmm