jbrowse icon indicating copy to clipboard operation
jbrowse copied to clipboard

bin/biodb-to-json.pl creates duplicate features

Open astralarya opened this issue 7 years ago • 8 comments

When using biod-to-json.pl to generate data from a Chado database, the script generates multiple features for a gene when that gene has several associated analysisfeatures. Is there a way to suppress the creation of these duplicate features? We do not wish to display these scores, or at the very least, would like to coalesce these entries into a single feature.

astralarya avatar Jan 10 '17 19:01 astralarya

Not sure if I was able to reproduce this one! Had gotten an email awhile back, not sure if there are updates

I made a test schema with some data like this

chado=# select * from analysisfeature;
 analysisfeature_id | feature_id | analysis_id | rawscore | normscore | significance | identity
--------------------+------------+-------------+----------+-----------+--------------+----------
                  8 |          8 |           5 |          |           |              |
                  9 |          9 |           5 |          |           |              |
                 10 |         10 |           5 |          |           |              |
                 12 |          9 |           6 |          |           |              |
                 11 |          8 |           6 |          |           |              |
                 13 |         10 |           6 |          |           |              |
(6 rows)

chado=# select feature_id,dbxref_id,organism_id,name,uniquename,seqlen,md5checksum,type_id,is_analysis,is_obsolete,timeaccessioned,timelastmodified from feature;
 feature_id | dbxref_id | organism_id | name  | uniquename | seqlen | md5checksum | type_id | is_analysis | is_obsolete |      timeaccessioned       |      timelastmodified
------------+-----------+-------------+-------+------------+--------+-------------+---------+-------------+-------------+----------------------------+----------------------------
          5 |           |           5 | ctgB  | ctgB       |   6079 |             |     638 | t           | f           | 2017-01-19 21:08:22.238024 | 2017-01-19 21:08:22.238024
          6 |           |           5 | ctgA  | ctgA       |  50001 |             |     638 | t           | f           | 2017-01-19 21:08:22.238024 | 2017-01-19 21:08:22.238024
          8 |           |           5 | match | match      |        |             |    1319 | t           | f           | 2017-01-19 21:08:27.395593 | 2017-01-19 21:08:27.395593
          9 |           |           5 | part1 | part1      |        |             |    2278 | t           | f           | 2017-01-19 21:08:27.395593 | 2017-01-19 21:08:27.395593
         10 |           |           5 | part2 | part2      |        |             |    2278 | t           | f           | 2017-01-19 21:08:27.395593 | 2017-01-19 21:08:27.395593
(5 rows)
chado=# select * from analysis;
 analysis_id |   name    | description | program | programversion | algorithm | sourcename | sourceversion | sourceuri |        timeexecuted
-------------+-----------+-------------+---------+----------------+-----------+------------+---------------+-----------+----------------------------
           5 | analysis1 |             |         | null           |           |            |               |           | 2017-01-19 21:08:22.238024
           6 | analysis2 |             |         | null           |           |            |               |           | 2017-01-19 21:08:22.238024
(2 rows)


Then use prepare-refseqs with volvox.fa and biodb-to-json.pl on something like this

{
   "tracks" : [
      {
         "feature" : [
            "match"
         ],
         "track" : "alignments"
      }
   ],
   "TRACK DEFAULTS" : {
      "autocomplete" : "all",
      "class" : "feature"
   },
   "db_args" : {

       "-dsn":"dbi:Pg:dbname=chado;host=localhost",
       "-user":"yyyyyyyyyyyyyyyyyy",
       "-pass":"xxxxxxxxxxxxxxxx"

   },
   "description" : "Volvox Example Database",
   "db_adaptor" : "Bio::DB::Das::Chado"
}

It didn't seem that this output had any duplicate features though

cmdcolin avatar Jan 27 '17 03:01 cmdcolin

Ok, after lots of digging, it appears that Bio::DB::Das::Chado::Segment in the sub features subroutine, uses the following query to identify distinct sequences:

  my $select_part = "select distinct f.name,fl.fmin,fl.fmax,fl.strand,fl.phase,"
                   ."fl.locgroup,fl.srcfeature_id,f.type_id,f.uniquename,"
                   ."f.feature_id, af.significance as score, "
                   ."fd.dbxref_id,f.is_obsolete ";

  my $order_by    = "order by f.type_id,fl.fmin ";

which leads me to believe that the important part here is the SELECT DISTINCT using af.significance. Could you try changing your test data to have distinct values for significance, which I think would lead to replication of the duplicate features problem.

astralarya avatar Feb 15 '17 19:02 astralarya

For reference: http://cpansearch.perl.org/src/SCAIN/Bio-DB-Das-Chado-0.35a/lib/Bio/DB/Das/Chado/Segment.pm

astralarya avatar Feb 15 '17 19:02 astralarya

The relevant call to Bio::DB::Das::Chado::Segment https://github.com/GMOD/jbrowse/blob/dd5e7a9e65b9cff9920d1857c677ecf89b009150/src/perl5/Bio/JBrowse/Cmd/BioDBToJson.pm#L106-L107

astralarya avatar Feb 15 '17 19:02 astralarya

A possible solution may be to filter out duplicate features present in $db_stream in Bio::JBrowse::Cmd::BioDbToJson. This should probably be controlled by a config option specific to each track.

astralarya avatar Feb 15 '17 19:02 astralarya

changing a significance value does indeed seem to duplicate the features!

cmdcolin avatar Feb 17 '17 02:02 cmdcolin

But if the data for two different features is not the same, don't we want to treat them as two distinct features?

rbuels avatar Jan 30 '18 17:01 rbuels

This is occurring when a single feature has multiple analysis values. There is only one feature, so the physical location (and feature_id) is identical in the database.

The bug is causing feature duplication when there are multiple analysisfeatures associated with a single feature that have distinct significance values (see my Feb 15th comment about Bio::DB::Das::Chado::Segment). This can happen when you have two statistics, eg. p(HeterozygoteExcess) and p(HeterozygoteDeficit), for a single feature, in this case a SNP. I don't think adding analyses should translate to duplicating the features.

astralarya avatar Feb 17 '18 07:02 astralarya