cDNA_Cupcake Saturation analysis bug fixes; compatibility with newer isoseq cluster [make_file_for_subsampling_from

Saturation analysis bug fixes; compatibility with newer isoseq cluster [make_file_for_subsampling_from_collapsed.py]

Open mrendleman opened this issue 1 year ago • 0 comments

I've encountered and fixed a couple of bugs with make_file_for_subsampling_from_collapsed.py, and figured I could document them here since similar GitHub issues have helped me.

Bug 1: Overestimating gene counts for saturation with SQANTI classifications

I found that saturation curves I built using SQANTI classifications (--by refgene or refisoform) were largely overestimating gene counts compared to curves without them (--by pbgene or pbid). It turns out that for novel isoforms in novel genes, SQANTI is assigning unique gene labels regardless of whether they belong to the same pbgene. I'm not certain if it's an issue with SQANTI, my data, or maybe an undocumented change in one of the PacBio output files, but Cupcake uses the associated_gene column in the SQANTI classification file and ends up treating each novel isoform as its own gene during subsampling.

Solution

To fix it, I made a small change to the SQANTI classification parsing (for loop starting at line 63) to only use the associated_gene column for non-novel genes. For novel genes, it sets refgene to a format that matches the refisoform category more closely using the gene ID assigned by PacBio collapse; e.g. for refisoform "novel_PB.2.5" we'll have refgene "novel_PB.2":

for r in DictReader(open(sqanti_class), delimiter='\t'):
    if r['associated_transcript'] == 'novel':
        refisoform = 'novel_'+r['isoform']
    else:
        refisoform = r['associated_transcript']
            
    # for novel genes, sets refgene to a sensible gene ID
    if r['associated_gene'].startswith("novelGene"):
        # trim isoform number off isoform, use that to make novel refgene
        refgene = "novel_" + r['isoform'].rsplit(".",1)[0]
    else: # if it's not novel, use the 'associated_gene' column to set refgene
        refgene = r['associated_gene']
    match_dict[r['isoform']] = {'refgene': refgene, # use refgene var instead of always using 'associated_gene'
                                'refisoform': refisoform,
                                'category': r['structural_category']}

Bug 2: Updates to `isoseq collapse` caused `make_file_for_subsampling_from_collapse.py` to report very low counts

Sometime after isoseq collapse v3.5.0 the format of the *.abundance.txt file was changed. The count_fl column is still present, but appears to have a different meaning than it used to. Instead, there's a new column named fl_assoc that looks like it contains counts that count_fl used to have. I couldn't find documentation on this change, but I did find that the *.abundance.txt headers have some ambiguous descriptions:

old count_fl: "Number of associated FL reads"
new count_fl: "Number of input FL reads associated with isoform"
fl_assoc: "Total number of FL reads associated with isosform"

Regardless of the new meaning, the new count_fl has much lower values including a lot of 1's, so the default --min_fl_count 2 when subsampling filters out most of the isoforms.

Solution

My solution is to check if *.abundance.txt has the fl_assoc column; if it does, it uses fl_assoc. Otherwise, it uses count_fl. This allows for backwards compatibility with files collapsed using older versions of isoseq. This change is in addition to the tweak from #111 (swapping the order of the "or" statement to fix an error when using the --include_single_exons flag). Here's the changes:

Original (lines 85-87):

for r in DictReader(f, delimiter='\t'):
    if r['pbid'] in good_ids or include_single_exons:
      to_write['all'][r['pbid']] = r['count_fl']

Fixed:

count_col_name = "" # initialize count_col_name
for r in DictReader(f, delimiter='\t'):
    if count_col_name == "": # if it hasn't been set yet, set it
        count_col_name = 'fl_assoc' if 'fl_assoc' in r.keys() else 'count_fl'
    if include_single_exons or r['pbid'] in good_ids:
        to_write['all'][r['pbid']] = r[count_col_name]

I can't guarantee these will fix these problems indefinitely or in all cases, but this is what solved them for me. Hope this helps someone!

Oct 12 '23 15:10 mrendleman

cDNA_Cupcake cDNA_Cupcake copied to clipboard

Saturation analysis bug fixes; compatibility with newer isoseq cluster [make_file_for_subsampling_from_collapsed.py]

Bug 1: Overestimating gene counts for saturation with SQANTI classifications

Solution

Bug 2: Updates to isoseq collapse caused make_file_for_subsampling_from_collapse.py to report very low counts

Solution

cDNA_Cupcake
cDNA_Cupcake copied to clipboard

Bug 2: Updates to `isoseq collapse` caused `make_file_for_subsampling_from_collapse.py` to report very low counts