cDNA_Cupcake
cDNA_Cupcake copied to clipboard
Saturation analysis bug fixes; compatibility with newer isoseq cluster [make_file_for_subsampling_from_collapsed.py]
I've encountered and fixed a couple of bugs with make_file_for_subsampling_from_collapsed.py, and figured I could document them here since similar GitHub issues have helped me.
Bug 1: Overestimating gene counts for saturation with SQANTI classifications
I found that saturation curves I built using SQANTI classifications (--by refgene
or refisoform
) were largely overestimating gene counts compared to curves without them (--by pbgene
or pbid
). It turns out that for novel isoforms in novel genes, SQANTI is assigning unique gene labels regardless of whether they belong to the same pbgene
. I'm not certain if it's an issue with SQANTI, my data, or maybe an undocumented change in one of the PacBio output files, but Cupcake uses the associated_gene
column in the SQANTI classification file and ends up treating each novel isoform as its own gene during subsampling.
Solution
To fix it, I made a small change to the SQANTI classification parsing (for loop starting at line 63) to only use the associated_gene
column for non-novel genes. For novel genes, it sets refgene to a format that matches the refisoform category more closely using the gene ID assigned by PacBio collapse; e.g. for refisoform "novel_PB.2.5" we'll have refgene "novel_PB.2":
for r in DictReader(open(sqanti_class), delimiter='\t'):
if r['associated_transcript'] == 'novel':
refisoform = 'novel_'+r['isoform']
else:
refisoform = r['associated_transcript']
# for novel genes, sets refgene to a sensible gene ID
if r['associated_gene'].startswith("novelGene"):
# trim isoform number off isoform, use that to make novel refgene
refgene = "novel_" + r['isoform'].rsplit(".",1)[0]
else: # if it's not novel, use the 'associated_gene' column to set refgene
refgene = r['associated_gene']
match_dict[r['isoform']] = {'refgene': refgene, # use refgene var instead of always using 'associated_gene'
'refisoform': refisoform,
'category': r['structural_category']}
Bug 2: Updates to isoseq collapse
caused make_file_for_subsampling_from_collapse.py
to report very low counts
Sometime after isoseq collapse v3.5.0
the format of the *.abundance.txt
file was changed. The count_fl
column is still present, but appears to have a different meaning than it used to. Instead, there's a new column named fl_assoc
that looks like it contains counts that count_fl
used to have. I couldn't find documentation on this change, but I did find that the *.abundance.txt
headers have some ambiguous descriptions:
- old
count_fl
: "Number of associated FL reads" - new
count_fl
: "Number of input FL reads associated with isoform" -
fl_assoc
: "Total number of FL reads associated with isosform"
Regardless of the new meaning, the new count_fl
has much lower values including a lot of 1's, so the default --min_fl_count 2
when subsampling filters out most of the isoforms.
Solution
My solution is to check if *.abundance.txt
has the fl_assoc
column; if it does, it uses fl_assoc
. Otherwise, it uses count_fl
. This allows for backwards compatibility with files collapsed using older versions of isoseq. This change is in addition to the tweak from #111 (swapping the order of the "or" statement to fix an error when using the --include_single_exons flag). Here's the changes:
Original (lines 85-87):
for r in DictReader(f, delimiter='\t'):
if r['pbid'] in good_ids or include_single_exons:
to_write['all'][r['pbid']] = r['count_fl']
Fixed:
count_col_name = "" # initialize count_col_name
for r in DictReader(f, delimiter='\t'):
if count_col_name == "": # if it hasn't been set yet, set it
count_col_name = 'fl_assoc' if 'fl_assoc' in r.keys() else 'count_fl'
if include_single_exons or r['pbid'] in good_ids:
to_write['all'][r['pbid']] = r[count_col_name]
I can't guarantee these will fix these problems indefinitely or in all cases, but this is what solved them for me. Hope this helps someone!