bioperl-live
bioperl-live copied to clipboard
parsing segmented Genbank records
Hi there,
I'm parsing a whole bunch of Genbank records to get CDS sequences, and found one weird record that messes up my pipeline. The Genbank accession is S81162. It turns out it's a segmented record (the CDS joins four regions from four different Genbank entries). Reading the wiki, it seems like Bioperl should be able to recognize this, but I think maybe the code no longer parses that part of the Genbank record? Details are below.
I'd like to just check for segmented records and skip them so they don't throw my code and I can still parse all the other records in the same file (I don't need every single CDS - I think segmented records will be rare).
From the help given here: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Getting_the_Annotations it looks like I should be able to use code like this:
my $anno_collection = $seq_obj->annotation;
and
$anno_collection->get_all_annotation_keys;
to recognize that this record has a SEGMENT annotation. However, it actually looks like the SEGMENT annotation is ignored when the Genbank record is parsed. Shouldn't it have come through into $anno_collection? I updated my bioperl from github this morning.
I guess I can parse each record outside of bioperl before I go into Bioperl, but it'd be great if I can just use Bioperl to get at those SEGMENT annotations. Does that seem easy to implement (or re-implement - from the HOWTO page it looks like it was something that used to be possible)?
Some code that hopefully shows what I mean is below.
thanks very much,
Janet Young
Dr. Janet Young Malik lab http://research.fhcrc.org/malik/en.html Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA. tel: (206) 667 4512
email: jayoung ...at... fhcrc.org
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
use Bio::DB::EUtilities;
##### get the troublesome sequence from Genbank:
my $file = "S81162.gb";
if (!-e $file) {
my $factory = Bio::DB::EUtilities->new(-eutil => 'efetch',
-db => 'protein',
-rettype => 'gb',
-email => '[email protected]',
-id => "S81162");
$factory->get_Response(-file => $file);
}
#### parse it:
my $seqIN = Bio::SeqIO->new(-file => "$file", '-format' => 'Genbank');
while (my $seq = $seqIN ->next_seq) {
##### look at the annotations - SEGMENT is not captured
my $anno_collection = $seq->annotation;
for my $key ( $anno_collection->get_all_annotation_keys ) {
my @annotations = $anno_collection->get_Annotations($key);
for my $value ( @annotations ) {
print "tagname : ", $value->tagname, "\n";
print " annotation value: ", $value->display_text, "\n\n";
}
}
}
Hi Janet,
this doesn't solve the possible bug, but there might be another Bioperl way. The DEFINITION line also includes the "segment" statement. Thus, you might skip those entries e.g. with:
while (my $seq = $seqIN ->next_seq) {
next if ($seq->desc() =~ /,\s+segment\s+\d+\s+of\s+\d+\]/); # not the prettiest regex
...
}
But frankly, I don't know if this is true for all gbk records.
But, if you're in Unix, I still think the easiest and fastest way is to grep on those files and move/remove them in a one-line shell loop.
HTH, Andreas
Thanks, Andreas. I am also not sure how consistent the DEFINITION line is.
The Unix grep isn't so great for me, because each of my files contains multiple Genbank entries, but I'm doing something similar in plain perl, splitting the file on "//\n" and keeping a list of any records that contain "\nSEGMENT\s".
Janet
@jayoung we'll have a look and try to get some support in for the next release.
Thanks, Chris - good luck!
Link to problematic record:
http://www.ncbi.nlm.nih.gov/nuccore/S81162