bcbb
bcbb copied to clipboard
GFF3 parent / id remapping issue.
Just filing it since even though this library doesn't seem to see a lot of activity lately.
Script:
#!/usr/bin/env python
import sys
from BCBio import GFF
from Bio import SeqIO
seq_dict = SeqIO.to_dict(SeqIO.parse(sys.argv[1], "fasta"))
# Parse GFF3 records
for record in GFF.parse(sys.argv[2], base_dict=seq_dict):
print record.id
for feature in record.features:
print feature.id, [x.id for x in feature.sub_features]
Input files:
>ctgA
ACTG
>ctgB
ACTG
##gff-version 3
##sequence-region ctgA 1 17039
ctgA glimmer gene 39 209 . + . ID=orf00001
ctgA Glimmer3 CDS 39 209 . + 0 ID=cds_orf00001;Parent=orf00001
ctgA glimmer gene 215 637 . + . ID=orf00002
ctgA Glimmer3 CDS 215 637 . + 0 ID=cds_orf00002;Parent=orf00002
ctgA glimmer gene 637 3588 . + . ID=orf00003
ctgA Glimmer3 CDS 637 3588 . + 0 ID=cds_orf00003;Parent=orf00003
ctgA glimmer gene 3873 4085 . - . ID=orf00004
ctgA Glimmer3 CDS 3873 4085 . - 0 ID=cds_orf00004;Parent=orf00004
##gff-version 3
##sequence-region ctgB 1 17040
ctgB glimmer gene 21 209 . + . ID=orf00001
ctgB Glimmer3 CDS 21 209 . + 0 ID=cds_orf00001;Parent=orf00001
ctgB glimmer gene 215 637 . + . ID=orf00002
ctgB Glimmer3 CDS 215 637 . + 0 ID=cds_orf00002;Parent=orf00002
ctgB glimmer gene 637 3588 . + . ID=orf00003
ctgB Glimmer3 CDS 637 3588 . + 0 ID=cds_orf00003;Parent=orf00003
ctgB glimmer gene 3873 4085 . - . ID=orf00004
ctgB Glimmer3 CDS 3873 4085 . - 0 ID=cds_orf00004;Parent=orf00004
Actual Output:
ctgA
orf00001 ['cds_orf00001']
orf00002 ['cds_orf00002', 'cds_orf00002']
orf00003 ['cds_orf00003', 'cds_orf00003']
orf00004 ['cds_orf00004', 'cds_orf00004']
ctgB
orf00001_2 ['cds_orf00001']
orf00002 []
orf00003 []
orf00004 []
Expected Output:
ctgA
orf00001 ['cds_orf00001']
orf00002 ['cds_orf00002']
orf00003 ['cds_orf00003']
orf00004 ['cds_orf00004']
ctgB
orf00001 ['cds_orf00001']
orf00002 ['cds_orf00002']
orf00003 ['cds_orf00003']
orf00004 ['cds_orf00004']
Eric; Sorry about the issue. Have you tried looking at parsing in gffutils (https://github.com/daler/gffutils)? It's a better representation of the approach used here and is well maintained by Ryan. I've been generally pointing folks there to work on a single approach for parsing GFF in Python. Hope that works for you.
@chapmanb no problem. Yeah, I saw that comment from you on another issue and started to look into it, but 100% of our tooling is based on your lib, so it'll be a bit of a migration process. And yes, that'll work fine for us as a solution.