how_are_we_stranded_here icon indicating copy to clipboard operation
how_are_we_stranded_here copied to clipboard

Can't find any of the first 10 BED transcript_ids in fasta file

Open eggrandio opened this issue 5 months ago • 0 comments

Hi,

I am getting the

Can't find any of the first 10 BED transcript_ids in fasta file

error while the transcript_ids of the bed file are clearly in the transcript fasta file.

I parsed the .gtf file with 'agat_convert_sp_gff2gtf.pl' and made a custom bash script to retain only the transcript_id as the FASTA header so there are no issues. Here is a sample of the gtf, the bed and the fasta files: GTF:

#!genome-build DOE Joint Genome Institute Ccitriodora_v2_1
#!genome-version Ccitriodora_v2_1
#!genome-date 2020-10
#!genome-build-accession GCA_014858505.1
#!genebuild-last-updated 2021-09
1	Ccitriodora_v2_1	region	1	31549113	.	.	.	gene_id "region:1"; Alias "CM026410.1"; ID "region:1";
1	JGI	gene	17519	18454	.	+	.	gene_id "gene-BT93_A0001"; ID "gene:gene-BT93_A0001"; biotype "protein_coding"; logic_name "genemodel_jgi";
1	JGI	mRNA	17519	18454	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; Parent "gene:gene-BT93_A0001"; biotype "protein_coding"; tag "Ensembl_canonical";
1	JGI	exon	17519	17859	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "2"; ensembl_phase "0"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; rank "1";
1	JGI	exon	17879	17949	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "1"; ensembl_phase "2"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; rank "2";
1	JGI	exon	18051	18454	.	+	.	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "0"; ensembl_phase "1"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; rank "3";
1	JGI	CDS	17519	17859	.	+	0	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1	JGI	CDS	17879	17949	.	+	1	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1	JGI	CDS	18051	18454	.	+	2	gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";

BED:

1	17518	18454	rna-gnl|WGS:JABURB|Cocit.A0001.1	0	+	17518	18454	0	3	341,71,404,	0,360,532,
1	31589	32512	rna-gnl|WGS:JABURB|Cocit.A0002.1	0	+	31589	32512	0	2	415,407,	0,516,
1	46163	47204	rna-gnl|WGS:JABURB|Cocit.A0003.1	0	+	46163	47204	0	2	441,475,	0,566,
1	62155	63760	rna-gnl|WGS:JABURB|Cocit.A0004.1	0	+	62155	63760	0	2	421,407,	0,1198,
1	64048	68743	rna-gnl|WGS:JABURB|Cocit.A0005.1	0	-	64048	68743	0	5	352,116,158,74,1541,	0,1195,1836,2993,3154,
1	70379	70571	rna-gnl|WGS:JABURB|Cocit.A0006.1	0	-	70379	70571	0	1	192,	0,
1	76482	79181	rna-gnl|WGS:JABURB|Cocit.A0007.1	0	+	76482	79181	0	3	360,387,570,	0,1132,2129,
1	76482	79181	rna-gnl|WGS:JABURB|Cocit.A0007.2	0	+	76482	79181	0	3	360,309,570,	0,1132,2129,
1	79626	88276	rna-gnl|WGS:JABURB|Cocit.A0008.1	0	-	79626	88276	0	10	551,102,86,272,92,138,71,181,157,243,	0,1512,1761,2584,4213,4387,5882,6738,7027,8407,
1	90394	103967	rna-gnl|WGS:JABURB|Cocit.A0009.1	0	+	90394	103967	0	17	450,91,119,191,211,241,416,109,200,350,97,219,240,210,286,197,882,	0,603,924,2440,2718,4600,5837,7150,7535,7867,8376,8552,9331,10114,11699,12126,12691,
1	111647	117022	rna-gnl|WGS:JABURB|Cocit.A0010.1	0	+	111647	117022	0	8	264,129,855,246,24,141,48,471,	0,1487,1745,3106,3458,3742,4009,4904,

FASTA:

>rna-gnl|WGS:JABURB|Cocit.A0001.1
ATGACAGCCCTCAAGCTCAAGAAGCTCCTCCTGACCGCCATCGCGGTCGCTGGGATCGTTGTCTCTGCTCTGCCTGACACCGCCTCGGCCCAGAACTGCGGGTGTGCAGCCAACC

Maybe it is because of the symbols in the transcript names?

eggrandio avatar Aug 30 '24 14:08 eggrandio