jbrowse icon indicating copy to clipboard operation
jbrowse copied to clipboard

biodb-to-json.pl: Tracks empty when chromosome Name and ID are different in GFF3 file

Open nisaea opened this issue 7 years ago • 9 comments

Hello,

When chromosome Name and ID differ in GFF3 file, using biodb-to-json.pl results in tracks containing no data in JBrowse.

We discovered this issue by trying to use GBrowse's yeast sample dataset in JBrowse. Here are the steps needed to reproduce the error:

  • Create a MySQL database and fill it with the data from both yeast directories here https://github.com/GMOD/GBrowse/tree/master/sample_data using bp_seqfeature_load.pl
  • Create a JSON config file with the database info and a few tracks (here's the script I use to do so http://pastebin.com/kKLvWdYd )
  • Use prepare-refseqs.pl --conf dbconf.json and biodb-to-json.pl --conf dbconf.json
  • Open in JBrowse

You should see the reference sequence properly, but none of the other tracks. A 404 error on every trackData.json file should appear in the logs. Our understanding is that despite the files being stored in the right place, the URL is generated using the chromosome Name instead of the ID (which is used to store the files on the filesystem). i.e. http://servername/jbrowse/path_to/data/tracks/CDS/ChrI/trackData.json with an uppercase C, which is the Name argument in the gff file, instead of http://servername/jbrowse/path_to/data/tracks/CDS/chrI/trackData.json as it is stored in the filesystem.

Thank you in advance for your help.

nisaea avatar Jan 25 '17 10:01 nisaea

Are you referring to chromosome features in the GFF (like a single line that has column three type being chromosome)? Why is there a mixture of Chr1 and chr1 in the data, do you have ID=chr1;Name=Chr1? What would be the purpose of having both?

I'm not trying to be antagonistic but just to understand the problem. There are some functions in jbrowse that tries to "normalize" refseq names so small differences like chr1 vs Chr1 don't matter but it's not perfect and doesn't really cover a lot of cases like maybe this one.

cmdcolin avatar Jan 27 '17 03:01 cmdcolin

Hello and thanks for the reply!

Are you referring to chromosome features in the GFF (like a single line that has column three type being chromosome)?

Yes, that's the only thing in the input data that contains an uppercase ChrX.

do you have ID=chr1;Name=Chr1?

Exactly. This is data from the GMOD GBrowse repo, you can check out the file in the GBrowse repo id you wish to see how it's built.

What would be the purpose of having both?

My understanding being that ID is a unique identifier and Name is a display name (see https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md ), which is consistent with what's going on in that file, I don't see anything weird here. Hence why I'm still scratching my head over why Name was used as an identifier, which seems pretty dangerous to me. IMHO, no amount of normalization can ensure that display names can always be consistent with unique identifiers, since people are free to put whatever they want in there by definition.

I hope this helps clarify the issue. :)

nisaea avatar Jan 27 '17 09:01 nisaea

I think the current behavior makes sense, and I imagine in the case where no Name is specified, that it just makes ID == Name. It seems in this case that jbrowse simply chooses one thing to identify the sequence by, and it chooses the 'name' attribute rather than the 'id' (or uniquename in chado land).

It is maybe useful to step back and just consider that jbrowse in general only identifies refseqs by a single name, so it cannot distinguish between a "sequence ID" and it's "name" , that is why it just chooses one thing from the database.

Therefore, instead of this being a particular bug in biodb-to-json, it's just a symptom of the fact that only one refseq name is ever used in jbrowse. Of couse, it would be great if multiple types of sequence names were supported, e.g. chr1, 1, or some ncbi ID for the chromosome, all at the same time, but not the case currently

cmdcolin avatar Feb 17 '17 02:02 cmdcolin

note that if you really wanted to also, you could also put the alternative name as an Alias in the gff, and in that case generate-names will index that alternative chromosome name, but this has many limitations and won't be used as not a proper chromosome identifier in jbrowse

cmdcolin avatar Feb 17 '17 02:02 cmdcolin

Hello and thanks again for the reply!

I hear you on the single refseq not necessarily needing a robust identification system. However, I'm afraid you sort of missed the point here. Setting aside the debate over why using display names as unique identifiers may be a bad idea, JBrowse still inconsistently uses both "Name" and "ID" here, which results in a 404 error.

More precisely, it uses the ID to creates the directories in which it writes the data, but uses the Name to build the URIs it uses to fetch that same data from the web app. As a result, the query fails and the tracks appear empty.

Indeed, I agree it wouldn't be a bug if it used Name instead of ID everywhere, but the inconsistent use of both is where the problem lies.

nisaea avatar Feb 17 '17 13:02 nisaea

Ah ok. I guess it needs to standardize on one or the other

cmdcolin avatar Feb 17 '17 14:02 cmdcolin

Yes. And since it seemed safer in the long run, that's why I suggested using ID everywhere. Also, I believe it would require less changes but I might be wrong here, I haven't looked into the code long enough to be sure.

nisaea avatar Feb 17 '17 16:02 nisaea

Update: Nevermind, it's probably easier to change the paths on the filesystem. It also makes generate-names.pl fail as the find_names_files subroutine in IndexNames.pm also uses Name to build the paths it looks into. I looked into the hashes it gets the Name from and for some reason the IDs aren't even stored in there. Do you have an idea why?

nisaea avatar Feb 23 '17 09:02 nisaea

Any chance somebody can write a failing test for this in tests/perl_tests?

rbuels avatar Jan 30 '18 17:01 rbuels