AGAT icon indicating copy to clipboard operation
AGAT copied to clipboard

agat_sp_extract_sequences.pl does not incorporate CDS feature ID in headers

Open jdcla opened this issue 2 months ago • 5 comments

Describe the bug According to the documentation, the headers created by the script are formatted:

ID gene=gene_ID name=NAME seq_id=Chromosome_ID type=cds 5'extra=VALUE

However, when applying this script to extract sequences of CDS features the header id's contain the id of the mRNA feature, rather than that of the selected feature CDS.

e.g. >transcript:ENST00000399012 gene=gene:ENSG00000182378 seq_id=X type=cds instead of >CDS:ENSP00000431562 gene=gene:ENSG00000182378 seq_id=X type=cds

General (please complete the following information): v1.4 Singularity Ubuntu Linux

To Reproduce Simply run the script on any gff3 file containing ID fields in the CDS attribute fields.

E.g., using https://ftp.ensembl.org/pub/release-111/gff3/homo_sapiens/. agat_sp_extract_sequences.pl -g Homo_sapiens.GRCh38.110.gff3 -f Homo_sapiens.GRCh38.dna.primary_assembly.fa -o cdss.fa -t cds

Expected behavior Use the CDS ID in the header rather than the transcript/mRNA ID.

Additional context Somewhat off-topic, but I was trying to apply this tool on gff3 files containing multiple CDS ID's per mRNA (multicistronic). It seems this is currently not supported.

jdcla avatar Apr 08 '24 19:04 jdcla