BridgeDb Massive duplication of datasources

I still can't get my head around why we have so much duplication of data sources in org.bridgedb.bio vs org.bridgedb.rdf:

https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.bio/resources/org/bridgedb/bio/datasources.txt
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/BioDataSource.ttl
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/DataSource.ttl
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.txt
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/MiriamRegistry.ttl

Fixing a data source name, URI pattern or system code in any one of these files seems to require editing all the others.

This means it is basically impossible to edit.

What is the point of this duplication?

Dec 09 '15 12:12 stain

https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.bio/resources/org/bridgedb/bio/datasources.txt is the historical source for non OPS/rdf based Bridgedb projects
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/MiriamRegistry.ttl is a download form Miriam / Identifiers.org
https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/DataSource.ttl is for datasource informaton NOT found in either of the two above.

These are the three read in!

The others are generated from the above purely for information and never used read by the system https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/BioDataSource.ttl is 1 in ttl format.

https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.txt is 2 in the format used for 1

https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/IdentifiersOrgDataSource.ttl is 3 in the ttl format.

Dec 09 '15 12:12 Christian-B

The general OPS rule has been.

Never change *.txt but to use the settings there where ever applicable.
Attempt to get as much as possible into miriam and the grab a new copy of their register
Only if info is not in 1 and either not applicable for miriam or too slow is it added to https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/resources/DataSource.ttl

Also in 3 goes the case where the datasource part of the URI set in 1 is too short For example CHEBI where not all URI have the capital "CHEBI" so that can not be part of the ID.

So if a datasource exists in 1. Use the bridgeDB short code

If it does not exist in 1 but does in Miriam use their code

Only make up a new Datasource code if not in 1 or 2

Dec 09 '15 12:12 Christian-B

Thanks for clarifying the historical aspect.. I am not sure why we need to keep the "infomation" ones around as they just give me misinformation :)

I don't know how to update from Miriam or Identifiers.org, or what would break if I did. IdentifiersOrgDataSource.* contain some manual patches from when identifiers.org was wrong or breaking stuff. (which was reported upstream, but as that takes a long time changes were also done locally here)

Dec 09 '15 13:12 stain

I've tried to summarize this in the README - but I think I still don't quite get the implied or expected data flow here..

@Christian-B - do you think you could clarify:

Which files are meant to be updated how?
Which files are generated from what
Which files can be removed, if any
How can we avoid all these files?

Dec 09 '15 14:12 stain

DataSoucre init is done by https://github.com/bridgedb/BridgeDb/blob/master/org.bridgedb.rdf/src/org/bridgedb/rdf/UriPattern.java refreshUriPatterns() method

NO Changing of DataSources shpuld be done outside of stuff called by that method.

Dec 09 '15 14:12 Christian-B

Ideal would be for all BridgeDB users to agree to use only the ttf method.

Which would allow merging the functionality and files from DataSourceTxt.init(); DataSourceMetaDataProvidor.assumeUnknownsAreBio(); BridgeDBRdfHandler.init();

This should now be possible as the RDF stuff is in the main branch!

I would not recommend removing or merging IdentifersOrgReader.init(); and the files it depends on. As this is an independent source of metadata which is a useful and near free addition.

Either way I would suggest OPS developers and anyone wanting to do thing the OPS way to do all updates via BridgeDBRdfHandler.init(); and the files it uses ONLY

Grabbing a new miriam register on a regular basis.

Dec 09 '15 14:12 Christian-B

Some more history.

we need a small datasources.txt for use of BridgeDb on WikiPathways (or used to)
we want a richer identifiers.org-based version
we may want to support other files
we currently support two formats, but in the future (near or later) use RDF if the overhead is not too large

That said, that should not inhibit us from cleaning things up, which, IMHO, is the right time for to do in the master branch. After all, we have the bridgedb2.x branch for (current) stable releases.

Dec 12 '15 13:12 egonw

A nice compromise could be.

Use the RDF as an offical source of this information.

But include the required DataSource data in any Derby file.

Then for shipping the RDF code would not be required.

Also as the DataSource stuff is tightly packed with the links there is no danger of breaking the links with later dataSource changes.

Dec 14 '15 09:12 brenninc

we need a small datasources.txt for use of BridgeDb on WikiPathways (or used to)

For the WikiPathways frontend, bridgedbjs currently depends on datasources.txt, but that code could be easily changed to use another format, as long as the same information is available. Then there wouldn't be any frontend dependencies on datasources.txt.

Do you know whether PathVisio (Java) depends on datasources.txt?

Sep 15 '16 18:09 ariutta

BridgeDb BridgeDb copied to clipboard

Massive duplication of datasources

BridgeDb
BridgeDb copied to clipboard