dropSeqPipe icon indicating copy to clipboard operation
dropSeqPipe copied to clipboard

Arabidopsis annotation missing transcipt_name

Open vuzun opened this issue 4 years ago • 2 comments

Hi,

I encountered an error after the ReduceGTF step in the pipeline - Invalid GTF line and missing transcript_name.

I am running it on Arabidopsis data, so I changed the rule that fetches the annotation to point to ftp://ftp.ensemblgenomes.org/pub/plants/release-47/gtf/arabidopsis_thaliana, but it looks like the Arabidopsis annotation doesn't have the transcript_name field like the human does.

Do you know if there a way to use the id instead of the name somehow? Or anything else that would work?

Thanks for the tool and all the help around here!

vuzun avatar Jul 09 '20 12:07 vuzun

Hi vuzu,

I came across the same issue which seems to come down to a flawed ensemble annotation and dropseq-tools requiring transcript_name to generate the refFlat. Apparently they didn't have non human and mouse samples in mind.

I cobbled together a workaround, which is I still have to test The idea is to copy the the gene_id as transcript_name when it is absent. I'm happy to share the full workflow later (have to run now), but find at least the convert snippet below as a start:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $10;
        else print $0
        }' > tmp.gtf
# this is still run by the pipeline?
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf

seb-mueller avatar Jul 09 '20 13:07 seb-mueller

As an update to the above. I've found a lot of genes being excluded because they have the same name. Drop-Seq-tools excludes them by default. It turns out, transcript_name needs to be as unique as possible. In the snippet above, field $10 was used which corresponds to gene_id, however this removes all the isoforms since they share the same gene_id. Using transcript_id in field $12 turns out to prevent that from happening. So ultimately, the following should be used instead:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $12;
        else print $0
        }' > tmp.gtf
# replace names containing semicolons with underscores since they dropseqtools doesn't handle them well:
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf

seb-mueller avatar Aug 10 '20 13:08 seb-mueller