jbrowse
jbrowse copied to clipboard
bin/biodb-to-json.pl creates duplicate features
When using biod-to-json.pl to generate data from a Chado database, the script generates multiple features for a gene when that gene has several associated analysisfeatures. Is there a way to suppress the creation of these duplicate features? We do not wish to display these scores, or at the very least, would like to coalesce these entries into a single feature.
Not sure if I was able to reproduce this one! Had gotten an email awhile back, not sure if there are updates
I made a test schema with some data like this
chado=# select * from analysisfeature;
analysisfeature_id | feature_id | analysis_id | rawscore | normscore | significance | identity
--------------------+------------+-------------+----------+-----------+--------------+----------
8 | 8 | 5 | | | |
9 | 9 | 5 | | | |
10 | 10 | 5 | | | |
12 | 9 | 6 | | | |
11 | 8 | 6 | | | |
13 | 10 | 6 | | | |
(6 rows)
chado=# select feature_id,dbxref_id,organism_id,name,uniquename,seqlen,md5checksum,type_id,is_analysis,is_obsolete,timeaccessioned,timelastmodified from feature;
feature_id | dbxref_id | organism_id | name | uniquename | seqlen | md5checksum | type_id | is_analysis | is_obsolete | timeaccessioned | timelastmodified
------------+-----------+-------------+-------+------------+--------+-------------+---------+-------------+-------------+----------------------------+----------------------------
5 | | 5 | ctgB | ctgB | 6079 | | 638 | t | f | 2017-01-19 21:08:22.238024 | 2017-01-19 21:08:22.238024
6 | | 5 | ctgA | ctgA | 50001 | | 638 | t | f | 2017-01-19 21:08:22.238024 | 2017-01-19 21:08:22.238024
8 | | 5 | match | match | | | 1319 | t | f | 2017-01-19 21:08:27.395593 | 2017-01-19 21:08:27.395593
9 | | 5 | part1 | part1 | | | 2278 | t | f | 2017-01-19 21:08:27.395593 | 2017-01-19 21:08:27.395593
10 | | 5 | part2 | part2 | | | 2278 | t | f | 2017-01-19 21:08:27.395593 | 2017-01-19 21:08:27.395593
(5 rows)
chado=# select * from analysis;
analysis_id | name | description | program | programversion | algorithm | sourcename | sourceversion | sourceuri | timeexecuted
-------------+-----------+-------------+---------+----------------+-----------+------------+---------------+-----------+----------------------------
5 | analysis1 | | | null | | | | | 2017-01-19 21:08:22.238024
6 | analysis2 | | | null | | | | | 2017-01-19 21:08:22.238024
(2 rows)
Then use prepare-refseqs with volvox.fa and biodb-to-json.pl on something like this
{
"tracks" : [
{
"feature" : [
"match"
],
"track" : "alignments"
}
],
"TRACK DEFAULTS" : {
"autocomplete" : "all",
"class" : "feature"
},
"db_args" : {
"-dsn":"dbi:Pg:dbname=chado;host=localhost",
"-user":"yyyyyyyyyyyyyyyyyy",
"-pass":"xxxxxxxxxxxxxxxx"
},
"description" : "Volvox Example Database",
"db_adaptor" : "Bio::DB::Das::Chado"
}
It didn't seem that this output had any duplicate features though
Ok, after lots of digging, it appears that Bio::DB::Das::Chado::Segment in the sub features
subroutine, uses the following query to identify distinct sequences:
my $select_part = "select distinct f.name,fl.fmin,fl.fmax,fl.strand,fl.phase,"
."fl.locgroup,fl.srcfeature_id,f.type_id,f.uniquename,"
."f.feature_id, af.significance as score, "
."fd.dbxref_id,f.is_obsolete ";
my $order_by = "order by f.type_id,fl.fmin ";
which leads me to believe that the important part here is the SELECT DISTINCT using af.significance
. Could you try changing your test data to have distinct values for significance, which I think would lead to replication of the duplicate features problem.
For reference: http://cpansearch.perl.org/src/SCAIN/Bio-DB-Das-Chado-0.35a/lib/Bio/DB/Das/Chado/Segment.pm
The relevant call to Bio::DB::Das::Chado::Segment https://github.com/GMOD/jbrowse/blob/dd5e7a9e65b9cff9920d1857c677ecf89b009150/src/perl5/Bio/JBrowse/Cmd/BioDBToJson.pm#L106-L107
A possible solution may be to filter out duplicate features present in $db_stream
in Bio::JBrowse::Cmd::BioDbToJson. This should probably be controlled by a config option specific to each track.
changing a significance value does indeed seem to duplicate the features!
But if the data for two different features is not the same, don't we want to treat them as two distinct features?
This is occurring when a single feature has multiple analysis values. There is only one feature, so the physical location (and feature_id) is identical in the database.
The bug is causing feature duplication when there are multiple analysisfeatures associated with a single feature that have distinct significance values (see my Feb 15th comment about Bio::DB::Das::Chado::Segment
). This can happen when you have two statistics, eg. p(HeterozygoteExcess) and p(HeterozygoteDeficit), for a single feature, in this case a SNP. I don't think adding analyses should translate to duplicating the features.