BridgeDb icon indicating copy to clipboard operation
BridgeDb copied to clipboard

Tolerate loading empty linkset files into IMS

Open randykerber opened this issue 6 years ago • 19 comments

Currently, when attempt to load a linkset file into IMS, if the linkset file is empty an Exception is thrown and the load terminates. emty means there are zero links in the linkset file. Here is contents of such a file:

randykerber avatar Nov 27 '17 17:11 randykerber

file: PDB/LINKSET_EXACT_PDB20171110.ttl

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix ops: <http://chemistry.openphacts.org//> .
@prefix : <http://chemistry.openphacts.org/download/20171110/PDB/LINKSET_EXACT_PDB20171110.ttl#> .
<http://chemistry.openphacts.org/download/20171110/PDB/LINKSET_EXACT_PDB20171110.ttl> void:inDataset <http://chemistry.openphacts.org/download/20171110/void_2017-11-10.ttl#pdb_exactMatch> .

randykerber avatar Nov 27 '17 17:11 randykerber

The only triple in the file is a void:inDataset statement. There are no "link" triples.

randykerber avatar Nov 27 '17 17:11 randykerber

I remember running into this issue earlier in the year. When I looked into it I recall the Exception was caused by some sort of pre-check phase performed on each file, where some piece of code attempted to read the first link triple/line of the file and verify that it had the correct link-predicate (and perhaps other checks). However, when there was no "first record", it resulted in an Exception. I could not see a way to fix without possibly wrecking other functionality.

This was probably 6+ months ago, so memory is a bit fuzzy.

randykerber avatar Nov 27 '17 17:11 randykerber

The correct behavior is probably to issue some sort of WARNING message that an empty linkset file was encountered, but to otherwise allow the process to proceed. Basically, an empty linkset file should be allowed and proceed with the load.

randykerber avatar Nov 27 '17 17:11 randykerber

Can you justify why we should allow loading an empty linkset file?

It is a deliberate quality assurance step that the loader performs these test.

AlasdairGray avatar Nov 27 '17 17:11 AlasdairGray

We were thinking about automating things more, and in an automated process it's easy enough to have a data set where there are no links... they found three such link sets right now... is an empty link set semantically wrong?

egonw avatar Nov 27 '17 18:11 egonw

We took the view that it is wrong. You have metadata saying there are links between these two data sources but then there are no links.

As such, we wanted the automated loading process to fail so that we would have to fix the issue rather than just seeing a bunch of warnings that are easy to ignore and pass over.

AlasdairGray avatar Nov 28 '17 09:11 AlasdairGray

Agree with @AlasdairGray that a linkset with no links is not really a linkset (and so should not declare itself as that), and a warning would be easily ignored (as IMS has loads of them) - and so you would no longer detect the probably more common case that the void declares the wrong void:linkPredicate.

But perhaps it could be allowed if the void also says void:triples 0?

stain avatar Nov 28 '17 15:11 stain

I'm in favor of allowing empty sets. I don't think the IMS loading is the right place or time to complain about this. The IMS function does not break after loading an empty set (which is still a set). The predicate cannot be wrong either, with zero links, so an exception for that sounds semantically wrong to me.

I see a link set as the result of a calculation (process) to determine the links. In fact, for many of our link sets these are in fact calculated. For others, the process is an analysis of an external data set...

An explicit claim that given that process that there are zero links makes a lot of sense to me. Without any VoID we also loose the observation that there were no links.

egonw avatar Nov 28 '17 17:11 egonw

I think your final point is more about the outcome of the linking process rather than the use of the linksets.

I would argue that from the use of linksets, having a set with no links doesn't make sense and will have an impact on the performance of the system, as we could easily generate thousands of linksets with no links, e.g. UniProt:Protein to aers:AdverseEvent.

AlasdairGray avatar Nov 29 '17 10:11 AlasdairGray

I don't think performance will be an issue, but there might be an issue if the empty linkset causes a false transitive path possibility. Not tested!

But we can just add an --ignore-empty flag then and see how it goes? To the command line or loader XML file?

stain avatar Nov 30 '17 09:11 stain

I was not referring the performance of the loader, but of the computation of links.

@egonw can you give a concrete use case where you have a linkset with no links that should be loaded into the IMS?

AlasdairGray avatar Nov 30 '17 09:11 AlasdairGray

@AlasdairGray, Randy/Ian are trying to finish the OPS 2.2 API release and running into empty link sets. Since many link sets are autogenerated, as part of a update release flow, some end up empty. See the aforementioned example by Randy...

egonw avatar Dec 03 '17 12:12 egonw

From the data governance perspective, I think it is better that we weed out the empty linksets as part of the loading process. If they are truely meant to be empty then we don't need to load them, they would generate no query answers. If they are not meant to be empty then we need to find out why they are empty.

AlasdairGray avatar Dec 04 '17 21:12 AlasdairGray

@randykerber, what command line call did you make on the above (empty) link set that we can use to reproduce the exception? (And thus, test the --ignore-empty option to be implemented...)

egonw avatar Feb 05 '18 14:02 egonw

The load command line would have used a load.xml that loaded all openphacts linksets (more than 100). I'm guessing you want something more concise?

The second entry in this issue has a complete linkset file that triggered this exception. I can fetch the void file that goes along with this. That enough?

randykerber avatar Feb 06 '18 00:02 randykerber

I've attached the void file (below) that references the zero-link linkset listed in the second entry of this 'issues' page.

I had to rename the file to void_2017-11-10.ttl.txt, else GitHub would not accept it. Should be named void_2017-11-10.ttl.

void_2017-11-10.ttl.txt

randykerber avatar Feb 06 '18 00:02 randykerber

Stacktrace:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at com.simontuffs.onejar.Boot.run(Boot.java:340)
        at com.simontuffs.onejar.Boot.main(Boot.java:166)
Caused by: org.bridgedb.utils.BridgeDBException: Error loading 
        at uk.ac.manchester.cs.openphacts.ims.loader.RunLoader.main(RunLoader.java:241)
        ... 6 more
Caused by: org.bridgedb.utils.BridgeDBException: Found 0 where 1 expected
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.getObjectURI(LinksetLoader.java:169)
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.getParser(LinksetLoader.java:132)
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.load(LinksetLoader.java:91)
        at uk.ac.manchester.cs.openphacts.ims.loader.LinksetLoader.load(LinksetLoader.java:69)
        at uk.ac.manchester.cs.openphacts.ims.loader.RunLoader.loadLinkset(RunLoader.java:102)
        at uk.ac.manchester.cs.openphacts.ims.loader.RunLoader.main(RunLoader.java:225)
        ... 6 more

egonw avatar Apr 12 '18 12:04 egonw

OK, turns out that to test this, I need to bake at least a new .war and possibly a new Docker... I am working on this, but Tina will likely want to make a new PathVisio release and Anders a new pvjs release, so pushing this to the next release... which will, in all fairness, follow really soon... Oh, and Jonathan is waiting for a release too... #rsro

egonw avatar Apr 17 '18 17:04 egonw