BridgeDb
BridgeDb copied to clipboard
Massive duplication of datasources
I still can't get my head around why we have so much duplication of data sources in org.bridgedb.bio
vs org.bridgedb.rdf
:
- https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.bio/resources/org/bridgedb/bio/datasources.txt
- https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/BioDataSource.ttl
- https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/DataSource.ttl
- https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.txt
- https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl
- https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/MiriamRegistry.ttl
Fixing a data source name, URI pattern or system code in any one of these files seems to require editing all the others.
This means it is basically impossible to edit.
What is the point of this duplication?
-
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.bio/resources/org/bridgedb/bio/datasources.txt is the historical source for non OPS/rdf based Bridgedb projects
-
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/MiriamRegistry.ttl is a download form Miriam / Identifiers.org
-
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/DataSource.ttl is for datasource informaton NOT found in either of the two above.
These are the three read in!
The others are generated from the above purely for information and never used read by the system https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/BioDataSource.ttl is 1 in ttl format.
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.txt is 2 in the format used for 1
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl is 3 in the ttl format.
The general OPS rule has been.
- Never change *.txt but to use the settings there where ever applicable.
- Attempt to get as much as possible into miriam and the grab a new copy of their register
- Only if info is not in 1 and either not applicable for miriam or too slow is it added to https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/DataSource.ttl
Also in 3 goes the case where the datasource part of the URI set in 1 is too short For example CHEBI where not all URI have the capital "CHEBI" so that can not be part of the ID.
So if a datasource exists in 1. Use the bridgeDB short code
If it does not exist in 1 but does in Miriam use their code
Only make up a new Datasource code if not in 1 or 2
Thanks for clarifying the historical aspect.. I am not sure why we need to keep the "infomation" ones around as they just give me misinformation :)
I don't know how to update from Miriam or Identifiers.org, or what would break if I did. IdentifiersOrgDataSource.*
contain some manual patches from when identifiers.org was wrong or breaking stuff. (which was reported upstream, but as that takes a long time changes were also done locally here)
I've tried to summarize this in the README - but I think I still don't quite get the implied or expected data flow here..
@Christian-B - do you think you could clarify:
- Which files are meant to be updated how?
- Which files are generated from what
- Which files can be removed, if any
- How can we avoid all these files?
DataSoucre init is done by https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/src/org/bridgedb/rdf/UriPattern.java refreshUriPatterns() method
NO Changing of DataSources shpuld be done outside of stuff called by that method.
Ideal would be for all BridgeDB users to agree to use only the ttf method.
Which would allow merging the functionality and files from DataSourceTxt.init(); DataSourceMetaDataProvidor.assumeUnknownsAreBio(); BridgeDBRdfHandler.init();
This should now be possible as the RDF stuff is in the main branch!
I would not recommend removing or merging IdentifersOrgReader.init(); and the files it depends on. As this is an independent source of metadata which is a useful and near free addition.
Either way I would suggest OPS developers and anyone wanting to do thing the OPS way to do all updates via BridgeDBRdfHandler.init(); and the files it uses ONLY
Grabbing a new miriam register on a regular basis.
Some more history.
- we need a small datasources.txt for use of BridgeDb on WikiPathways (or used to)
- we want a richer identifiers.org-based version
- we may want to support other files
- we currently support two formats, but in the future (near or later) use RDF if the overhead is not too large
That said, that should not inhibit us from cleaning things up, which, IMHO, is the right time for to do in the master branch. After all, we have the bridgedb2.x branch for (current) stable releases.
A nice compromise could be.
Use the RDF as an offical source of this information.
But include the required DataSource data in any Derby file.
Then for shipping the RDF code would not be required.
Also as the DataSource stuff is tightly packed with the links there is no danger of breaking the links with later dataSource changes.
we need a small datasources.txt for use of BridgeDb on WikiPathways (or used to)
For the WikiPathways frontend, bridgedbjs
currently depends on datasources.txt
, but that code could be easily changed to use another format, as long as the same information is available. Then there wouldn't be any frontend dependencies on datasources.txt.
Do you know whether PathVisio (Java) depends on datasources.txt?