IsoQuant icon indicating copy to clipboard operation
IsoQuant copied to clipboard

GFF3 cannot be recognized

Open sanyalab opened this issue 1 year ago • 4 comments

Hi,

The tool says that it can work with GFF3. But it only works with GTF. Can we get GFF3 support?

image

Error I get when I provide GFF3 formatted file with the --genedb option

2024-09-19 11:35:13,297 - ERROR - Input GTF seems to be corrupted (see warnings above).
2024-09-19 11:35:13,297 - ERROR - An attempt to correct this GTF was made, the result is written to dummy.corrected.gff3
2024-09-19 11:35:13,297 - ERROR - NB! some transcript / gene ids in the corrected annotation are modified.
2024-09-19 11:35:13,297 - ERROR - Provide a correct GTF by fixing the original input GTF or checking the corrected one.

Do you consume the gene annotations in GTF format or Bed12 format? Is it ok to provide a bed12 file directly?

Thanks Abhijit

sanyalab avatar Sep 20 '24 02:09 sanyalab

Dear @sanyalab

IsoQuant does support both GTF and GFF, but not BED. Could you send me the entire isoquant.log file? Also, you can try running IsoQuant with --no_gtf_check.

Best Andrey

andrewprzh avatar Sep 20 '24 09:09 andrewprzh

Hi Andrey,

I actually went ahead and converted the GFF3 to a geneDB format using gffutils. This would be a preprocessing step. It seems to be running fine now. The isoquant.log file is 152MB in size and I cannot upload the same. But here are the first 10 lines and the last 10 FIRST:

Command line: isoquant.py --reference genome.fa --genedb Annotation.gff3 --fastq Sample1.flnc.fastq Sample2.flnc.fastq Sample3.flnc.fastq Sample4.flnc.fastq --output FL_ALL --prefix OUT --data_type pacbio_ccs --fl_data --threads 24 --check_canonical --sqanti_output --matching_strategy precise --splice_correction_strategy default_pacbio --model_construction_strategy fl_pacbio
2024-09-19 11:34:28,180 - INFO - Running IsoQuant version 3.5.0
2024-09-19 11:34:28,222 - INFO -  === IsoQuant pipeline started ===
2024-09-19 11:34:28,222 - INFO - gffutils version: 0.13
2024-09-19 11:34:28,223 - INFO - pysam version: 0.22.1
2024-09-19 11:34:28,223 - INFO - pyfaidx version: 0.8.1.1
2024-09-19 11:34:28,228 - INFO - Checking input gene annotation
2024-09-19 11:34:29,316 - WARNING - Malformed GTF line 2 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,316 - WARNING - Chr00	GSAP	gene	151	2235	.	+	.	ID=dummy1;Name=dummy1
2024-09-19 11:34:29,316 - WARNING - Malformed GTF line 3 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00	GSAP	mRNA	151	2235	.	+	ID=dummy1.1;Parent=dummy1;Name=dummy1.1
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 4 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00	GSAP	exon	151	2235	.	+	.	ID=dummy1.1.exon1;Parent=dummy1.1
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 5 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00	GSAP	CDS	151	2235	.	+	0	ID=dummy1.1.cds1;Parent=dummy1.1
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 6 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00	GSAP	gene	2412	4316	.	+	.	ID=dummy2;Name=dummy2
2024-09-19 11:34:29,317 - WARNING - Malformed GTF line 7 (gene_id attribute value cannot be found)
2024-09-19 11:34:29,317 - WARNING - Chr00	GSAP	mRNA	2412	4316	.	+	.	ID=dummy2.1;Parent=dummy2;Name=dummy2.1

LAST:

2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638230 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26	GSAP	exon	1450283	1450513	.	+	.	ID=dummy6432.1.exon1;Parent=dummy6432.1
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638231 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26	GSAP	CDS	1450283	1450513	.	+	0	ID=dummy6432.1.cds1;Parent=dummy6432.1
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638232 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26	GSAP	gene	1465536	1465607	.	-	.	ID=dummy6433;Name=dummy6433
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638233 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26	GSAP	mRNA	1465536	1465607	.	-	.	ID=dummy6433.1;Parent=dummy6433;Name=dummy6433.1
2024-09-19 11:35:13,258 - WARNING - Malformed GTF line 638234 (gene_id attribute value cannot be found)
2024-09-19 11:35:13,258 - WARNING - Chr26	GSAP	exon	1465536	1465607	.	-	.	ID=dummy6433.1.exon1;Parent=dummy6433.1
2024-09-19 11:35:13,297 - ERROR - Input GTF seems to be corrupted (see warnings above).
2024-09-19 11:35:13,297 - ERROR - An attempt to correct this GTF was made, the result is written to /Path/FL_ALL/Annotation.corrected.gff3
2024-09-19 11:35:13,297 - ERROR - NB! some transcript / gene ids in the corrected annotation are modified.
2024-09-19 11:35:13,297 - ERROR - Provide a correct GTF by fixing the original input GTF or checking the corrected one.

Its not recognizing the GFF3 file

sanyalab avatar Sep 20 '24 13:09 sanyalab

@sanyalab

Thanks a lot! I will add GFF3 support to the internal checker. So if gffutils converted it, you can run IsoQuant with --no_gtf_check as well.

andrewprzh avatar Sep 20 '24 13:09 andrewprzh

GFF3 should work in IsoQuant 3.6.1 without warnings.

andrewprzh avatar Sep 25 '24 13:09 andrewprzh