Replace a time-consuming lookup for the ERD dataset
Problem
At the moment, the ERD dataset has a time-consuming lookup operation to transform freebase IRIs to DBpedia IRIs. See https://github.com/dice-group/gerbil/blob/master/src/main/java/org/aksw/gerbil/dataset/impl/erd/ERDDataset2.java#L100
Solution
There are two possible solutions:
- The easiest solution would be to run the lookup once and store the dataset with the retrieved DBpedia IRIs (e.g., as NIF file).
- Make use of a sameAs lookup service
Hello @MichaelRoeder, I'd liike to work on this issue. Could you provide me more details?
Hi π
Let's go with solution 1 since solution 2 is more or less blocked at the moment.
(Note: the code below has not been tested... so expect some surprises on the way π)
0. Setup GERBIL
First, you should clone the master branch and run the start.sh script once. It will ask you whether you want to download indexes. This is not necessary for this task. When the script holds and the GERBIL web service is running, the script should have downloaded all the data that is needed including the ERD dataset. You can stop the web service or let it run, since we may want to use it later on to test the generated file.
1. Run the lookup once
That is quite easy since the lookup itself is already implemented. However, because of the check that I would like to include into the workflow, the class that we are going to write has to go into the test part of the project. So within the src/test/java part of the Maven project, we create a class org.aksw.gerbil.tools.Erd2NifTransformation with a main method that creates an instance of this class and calls a method run. The latter is used for the majority of the implementation. We start by loading the ERD dataset. We can easily do that using the provided classes:
ERDDataset2 dataset = new ERDDataset2("gerbil_data/datasets/erd2014/Trec_beta.query.txt","gerbil_data/datasets/erd2014/Trec_beta.annotation.txt");
dataset.init();
After that, the dataset object contains the complete ERD dataset including the retrieved DBpedia IRIs.
Please also add the @ignore tag to the class since it won't implement any test case.
2. Store the dataset as NIF
NIFWriter writer = new TurtleNIFWriter();
String nifString = writer.writeNIF(dataset.getInstances());
The nifString contains the data as NIF and can be written to a file, e.g., erd2014.ttl.
This file should then be moved (or directly written) into the directory gerbil_data/datasets/erd2014/
3. Test the NIF data
We want to ensure that the generated file is correct. That is the reason why we put the class into the test part of the project π
So directly after generating the String containing the NIF data, we can parse this data again to get our list of documents that represent the data that we have got.
NIFParser parser = new TurtleNIFParser();
List<Document> nifDocuments = parser.parseNIF(nifString);
Then, we have to let our newly implemented class extend the class AbstractExperimentTaskTest. Next, we define an Experiment (which is exactly what GERBIL has been made for π) and execute it, with the expectation that we get an F1-score of 1.0.
// Create new experiment configuration
ExperimentTaskConfiguration configuration = new ExperimentTaskConfiguration(
// We use the created NIF documents as results of an annotator
new TestAnnotatorConfiguration(nifDocuments, ExperimentType.A2KB),
// We use a new InstanceListBasedDataset instead of the "original" ERD dataset object. The original dataset object would rerun the whole preprocessing.
new InstanceListBasedDataset(dataset.getInstances(), ExperimentType.A2KB),
// We look at an A2KB experiment and we want to have strong annotation matching
ExperimentType.A2KB, Matching.STRONG_ANNOTATION_MATCH);
int experimentTaskId = 1;
double[] expectedResults = new double[] { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0 };
SimpleLoggingResultStoringDAO4Debugging experimentDAO = new SimpleLoggingResultStoringDAO4Debugging();
runTest(experimentTaskId, experimentDAO, new EvaluatorFactory(), configuration,
new F1MeasureTestingObserver(this, experimentTaskId, experimentDAO, expectedResults));
If this test passes, the majority of the work is already done π
Hi @MichaelRoeder π
I followed the suggested approach (running the lookup once, storing the ERD dataset as NIF using TurtleNIFWriter, and validating against the original dataset as described). The NIF file is being generated and parsed back correctly, but the validation step still gives an F1-score of 0.9765 instead of the expected 1.0.
Iβve checked the implementation against the instructions and confirmed that the lookup and writing steps are working as intended.
Could you please advise what additional steps or adjustments we should try to achieve the perfect match?
Thanks!
Can you please push your changes into a separate branch? It is much easier to make suggestions if the code is shared π
Sure, Iβve pushed my changes to a separate branch here: https://github.com/dice-group/gerbil/tree/erd-nif-fix
The class seems to do what it should π
If we take a look at the log, we can see the following interesting lines:
2025-08-19 11:44:15,381 [main] WARN [org.aksw.gerbil.io.nif.utils.NIFPositionHelper] - <Found an abnormal marking that has a letter directly behind it: "'condo's in florida">
2025-08-19 11:44:15,383 [main] WARN [org.aksw.gerbil.io.nif.utils.NIFPositionHelper] - <Found an abnormal marking that has a letter directly behind it: "'ritz carlto'n lake las vegas">
These two lines are not the exact source of the problem but show that the positions of two annotations are not fully correct. It could be worth investigating why this happens. I guess it is a problem in the original ERD dataset.
2025-08-19 11:45:01,415 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a true positive ((0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Kentucky)])).>
2025-08-19 11:45:01,415 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Florida)])).>
2025-08-19 11:45:01,415 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Minnesota)])).>
...
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a true positive ((0, 13, [http://dbpedia.org/resource/The_Music_Man])).>
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 13, [http://dbpedia.org/resource/The_Music_Man_(1962_film)])).>
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 13, [http://dbpedia.org/resource/The_Music_Man_(2003_film)])).>
...
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a true positive ((0, 17, [http://dbpedia.org/resource/Mary,_Mary,_Quite_Contrary])).>
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(musical)])).>
2025-08-19 11:45:01,422 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(1993_film)])).>
2025-08-19 11:45:01,422 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(1949_film)])).>
2025-08-19 11:45:01,422 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(1987_film)])).>
These are the three documents, in which false negatives are identified. If you take a close look, you can see a pattern. For example, in the first set of messages, one IRI with East_Ridge_High_School is found while 2 other IRIs are not found. Note that all three should have been at the same position in the text (0,22).
This already looks wrong, because it would mean that a single part of the text has three different meanings at the same time, which is not really allowed in GERBIL.
We can see that the NIF reader also reads it in this way in the following lines from the log:
2025-08-19 11:44:15,661 [Thread-1] DEBUG [org.aksw.gerbil.annotator.decorator.ErrorCountingAnnotatorDecorator] - <[Test-A2KB] result=[MeaningSpan(0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Minnesota), http://dbpedia.org/resource/East_Ridge_High_School_(Kentucky), http://dbpedia.org/resource/East_Ridge_High_School_(Florida)])]>
...
2025-08-19 11:44:15,663 [Thread-1] DEBUG [org.aksw.gerbil.annotator.decorator.ErrorCountingAnnotatorDecorator] - <[Test-A2KB] result=[MeaningSpan(0, 13, [http://dbpedia.org/resource/The_Music_Man, http://dbpedia.org/resource/The_Music_Man_(2003_film), http://dbpedia.org/resource/The_Music_Man_(1962_film)])]>
2025-08-19 11:44:15,663 [Thread-1] DEBUG [org.aksw.gerbil.annotator.decorator.ErrorCountingAnnotatorDecorator] - <[Test-A2KB] result=[MeaningSpan(0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(musical), http://dbpedia.org/resource/The_Secret_Garden_(1993_film), http://dbpedia.org/resource/Mary,_Mary,_Quite_Contrary, http://dbpedia.org/resource/The_Secret_Garden_(1987_film), http://dbpedia.org/resource/The_Secret_Garden_(1949_film)])]>
So for the NIF reader, there is only one marking with three or more IRI, which all point to different named entities.
To summarize: the ERD dataset reader seems to have a problem that causes it to generate multiple annotations for a single position in the text. With your transformation, these faulty annotations are then transformed into valid RDF, and because NIF will see them as a single annotation (since they refer to the same part of the text), we get a single annotation with all the IRIs. Then, during the evaluation this leads to a lower Recall, because the original dataset contains more annotations than the one that we got from the NIF file.
Long story short: we should double check the ERD dataset and the reader to fix the problem at the source π
Okay, it is an issue of the dataset... π
From that perspective, your NIF file is even better than the original ERD dataset π€
So, the file that you have generated is correct as it is π π
Next steps:
4. Change the ERD dataset definition
You can find the definitions of the "well-known" datasets like ERD in the datasets.properties file.
You want to change the ERD definition to use the file-based NIF dataset classes instead of the old implementation. The OKE dataset definitions are a good example to see how it should look like.
5. Mark the old classes as deprecated
Mark the classes in the erd package as @deprecated and add a javadoc comment to explain why they are outdated.
6. Add javadoc comments
Your newly created classes contains some comments in the code (good!) but the class itself and the single methods should have javadoc comments (unless their meaning is very obvious, e.g., the main method or simple getter or setter methods).
Hello @MichaelRoeder,
Iβve updated the dataset definition to use the NIF file, deprecated the old ERD classes, and added Javadoc. Hereβs the updated branch: erd-nif-fix.