cufflinks icon indicating copy to clipboard operation
cufflinks copied to clipboard

Cuffmerge GFF Error: duplicate/invalid 'transcript'

Open aswarren opened this issue 7 years ago • 13 comments

The problem I am finding seems to be from Cufflinks v2.2.1 adding the same transcript to transcripts.gtf twice.

I am running a test with data from the following HISAT paper. http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html ftp://ftp.ccb.jhu.edu/pub/RNAseq_protocol

Using the following samples ERR188044,ERR188383,ERR204916

Using human genome sequence and annotation ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.33_GRCh38.p7

I am running HISAT2 alignment with the --dta-cufflinks option. Subsequently I am running on cufflinks on all resulting alignments with this cufflinks -g ./GCF_000001405.33_GRCh38.p7.gff -b ./GCF_000001405.33_GRCh38.p7_genomic.fna -I 50 -p 8 ./ERR204916_chrX_1.fastq_ERR204916_chrX_2.fastq.bam

Finally I am running cuffmerge on all three transcritps.gtf files: cuffmerge -g ./GCF_000001405.33_GRCh38.p7.gff -p 8 -o ./merged_annotation ./gtf_manifest.txt

Its handing me a [Wed Oct 5 01:55:28 2016] Converting GTF files to SAM [01:55:28] Loading reference annotation. GFF Error: duplicate/invalid 'transcript' feature ID=id232805

which makes sense since the following line exists twice in the output from cufflinks ./ERR188044/replicate1/transcripts.gtf:NC_000002.12 Cufflinks transcript 91814764 91818316 1 + . gene_id "CUFF.272"; transcript_id "id232805"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_read_support "no";

Let me know if you need any other information. Thanks.

aswarren avatar Oct 05 '16 18:10 aswarren

FWIW this seems to be present on the following threads: https://biostar.usegalaxy.org/p/17359/ https://www.biostars.org/p/119915/ https://www.biostars.org/p/187275/ https://www.biostars.org/p/155160/

aswarren avatar Oct 05 '16 19:10 aswarren

I encounter the same problem. I've created a repository with a fully functional minimal example here: https://github.com/paulklemm/cuffmerge_bug.

The problem occurs when using the GFF3 file from Ensembl. The GTF file from the same release works fine. I've found that the content of the files also differs, which may cause the problem. See the repo for further explanation.

I can also add a couple of Threads describing the problem:

I'm not sure if this is a problem of cufflinks or ensembl. Can one of the authors please comment?

paulklemm avatar Oct 26 '16 13:10 paulklemm

It looks like it is a problem in the (old) GFF parsing code used there which stumbles trying to understand some of the recent transcript definitions.. I am going to submit a fix in the develop branch soon. Thank you for posting the minimal example, it's very useful for tracking down the problem. A silly workaround for now would be to run the last version of gffread (not the one included with cufflinks, the updated one from http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread) in order to "convert" the Ensemble file to a more "normalized" GFF or GTF -- but of course this is pointless as one can simply use the Ensemble GTF instead. I'll also update the related gffread code bundled with Cufflinks here as well -- again, all my patches will be submitted into the develop branch here soon.

gpertea avatar Oct 26 '16 13:10 gpertea

Just pushed the update in the develop branch. However that is NOT a stable branch for production use so I also prepared a source update tarball for the old official Cufflinks 2.2.1 source distribution to bring the GFF parser code, gffread and cuffcompare up to date. cufflinks-2.2.1.gff_patch.tar.gz

Alternate download in case the attachment doesn't work: http://ccb.jhu.edu/dl/cufflinks-2.2.1.gff_patch.tar.gz

This tarball should be just unpacked within the cufflinks-2.2.1 directory, something like this:

#unpack the 2.2.1 source tarball first:
tar xvfz ~/Downloads/cufflinks-2.2.1.tar.gz
cd cufflinks-2.2.1
#unpack the patch files
tar xvfz ~/Downloads/cufflinks-2.2.1.gff_patch.tar.gz
#configure and build it
./configure ....
make install

I know it could be difficult to build Cufflinks on newer Linux systems, I can rebuild v2.2.1 with this patch applied, on an older RHEL 5 machine and provide the Linux binary tarball if there is demand.

Please note: this patch hasn't been seriously tested, but it should not affect anything in the Cufflinks' internals -- it is only meant to fix the auxiliary code related to GFF/GTF (reference annotation) parsing while also updating gffread and cuffcompare utilities.

gpertea avatar Oct 27 '16 13:10 gpertea

Wow, that was fast, thanks a lot. Maybe you should forward that info also to the threads mentioned above.

It would be great if you can provide the patched Linux binary.

paulklemm avatar Oct 27 '16 14:10 paulklemm

OK, statically linked Linux build here: http://ccb.jhu.edu/dl/cufflinks-2.2.1-gffpatch.Linux_x86_64.tar.gz

Haven't gotten around to test it on your full GRCm38.86 example, getting those files and building the hisat2 index takes a while -- but if you already have all those and can quickly check that the patched Linux binaries I just provided work fine (or not), that would be great!

gpertea avatar Oct 27 '16 14:10 gpertea

Thanks for the binaries. I will do that tomorrow.

paulklemm avatar Oct 27 '16 14:10 paulklemm

I've just finished running your example_workflow_gff3.sh and it was successful with the patched binaries. I'll run the GTF version next (just from the cufflinks commands of course), to make sure that it still works and that the results are identical.

gpertea avatar Oct 27 '16 16:10 gpertea

I've got a similar problem, working with Mus_musculus.GRCm38.86.gtf.gz. downloaded from ENSEMBL, using cufflinks OR cuffmerge. Either errors out trying to load the gtf:

Loading reference annotation. GFF Error: duplicate/invalid 'transcript' feature ID=ENSMUST00000117299 [FAILED] Error: could not execute gtf_to_sam.

The problem showed up running cufflinks v.2.2.0. I downloaded and am running the patch binary and I am still getting the identical error. I know I am running that new binary- when I call cufflinks, I get:

Cuff/cufflinks-2.2.1-gffpatch.Linux_x86_64/cufflinks: /lib64/libz.so.1: no version information available (required by Cuff/cufflinks-2.2.1-gffpatch.Linux_x86_64/cufflinks) cufflinks v2.2.1 linked against Boost version 105900

The full command was: cufflinks-2.2.1-gffpatch.Linux_x86_64/cuffmerge -o merged.gtf -g Mus_musculus.GRCm38.86.gtf -p 12 -s genomes/mm10_total.fa CuffgtfFiles.txt

hooperj avatar Dec 03 '16 22:12 hooperj

It looks like it's the new "Selenocyteine" feature lines added by Ensembl which are now confusing the parser -- I would suggest using something like grep -vP '\tSelenocysteine\t' to remove them, they are within the exon spans so by eliminating these lines no information is lost for the analysis here. Yuck, I have to fix the parser to ignore these lines..

gpertea avatar Dec 03 '16 22:12 gpertea

yes, I saw that selnocysteine was in several of the offending genes in their attribute field

I’m a idiot on the command line would the full command be

grep -vP '\tSelenocysteine\t’ myNastygtf > myGoodgtf

hooperj avatar Dec 03 '16 22:12 hooperj

yes, that would work. You can use the compressed file directly (that's what I did):

zcat Mus_musculus.GRCm38.86.gtf.gz | grep -vP '\tSelenocysteine\t' > GRCm38.86.bye_seleno.gtf

gpertea avatar Dec 03 '16 22:12 gpertea

Hi, I was wondering whether this issue got fixed - I am currently using the work around you discussed here (really appreciate the help!). Thanks!

dgavino avatar Dec 30 '16 20:12 dgavino