gerbil icon indicating copy to clipboard operation
gerbil copied to clipboard

Replace a time-consuming lookup for the ERD dataset

Open MichaelRoeder opened this issue 1 year ago β€’ 8 comments

Problem

At the moment, the ERD dataset has a time-consuming lookup operation to transform freebase IRIs to DBpedia IRIs. See https://github.com/dice-group/gerbil/blob/master/src/main/java/org/aksw/gerbil/dataset/impl/erd/ERDDataset2.java#L100

Solution

There are two possible solutions:

  1. The easiest solution would be to run the lookup once and store the dataset with the retrieved DBpedia IRIs (e.g., as NIF file).
  2. Make use of a sameAs lookup service

MichaelRoeder avatar Aug 16 '24 10:08 MichaelRoeder

Hello @MichaelRoeder, I'd liike to work on this issue. Could you provide me more details?

Ajaykumarchawla avatar Aug 08 '25 08:08 Ajaykumarchawla

Hi πŸ™‚

Let's go with solution 1 since solution 2 is more or less blocked at the moment.

(Note: the code below has not been tested... so expect some surprises on the way πŸ˜‰)

0. Setup GERBIL

First, you should clone the master branch and run the start.sh script once. It will ask you whether you want to download indexes. This is not necessary for this task. When the script holds and the GERBIL web service is running, the script should have downloaded all the data that is needed including the ERD dataset. You can stop the web service or let it run, since we may want to use it later on to test the generated file.

1. Run the lookup once

That is quite easy since the lookup itself is already implemented. However, because of the check that I would like to include into the workflow, the class that we are going to write has to go into the test part of the project. So within the src/test/java part of the Maven project, we create a class org.aksw.gerbil.tools.Erd2NifTransformation with a main method that creates an instance of this class and calls a method run. The latter is used for the majority of the implementation. We start by loading the ERD dataset. We can easily do that using the provided classes:

ERDDataset2 dataset = new ERDDataset2("gerbil_data/datasets/erd2014/Trec_beta.query.txt","gerbil_data/datasets/erd2014/Trec_beta.annotation.txt");
dataset.init();

After that, the dataset object contains the complete ERD dataset including the retrieved DBpedia IRIs.

Please also add the @ignore tag to the class since it won't implement any test case.

2. Store the dataset as NIF

NIFWriter writer = new TurtleNIFWriter();
String nifString = writer.writeNIF(dataset.getInstances());

The nifString contains the data as NIF and can be written to a file, e.g., erd2014.ttl.

This file should then be moved (or directly written) into the directory gerbil_data/datasets/erd2014/

3. Test the NIF data

We want to ensure that the generated file is correct. That is the reason why we put the class into the test part of the project πŸ˜‰

So directly after generating the String containing the NIF data, we can parse this data again to get our list of documents that represent the data that we have got.

NIFParser parser = new TurtleNIFParser();
List<Document> nifDocuments = parser.parseNIF(nifString);

Then, we have to let our newly implemented class extend the class AbstractExperimentTaskTest. Next, we define an Experiment (which is exactly what GERBIL has been made for πŸ˜‰) and execute it, with the expectation that we get an F1-score of 1.0.

// Create new experiment configuration
ExperimentTaskConfiguration configuration = new ExperimentTaskConfiguration(
    // We use the created NIF documents as results of an annotator
    new TestAnnotatorConfiguration(nifDocuments, ExperimentType.A2KB), 
    // We use a new InstanceListBasedDataset instead of the "original" ERD dataset object. The original dataset object would rerun the whole preprocessing.
    new InstanceListBasedDataset(dataset.getInstances(), ExperimentType.A2KB),
    // We look at an A2KB experiment and we want to have strong annotation matching
    ExperimentType.A2KB, Matching.STRONG_ANNOTATION_MATCH);

    int experimentTaskId = 1;
    double[] expectedResults = new double[] { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0 };
    SimpleLoggingResultStoringDAO4Debugging experimentDAO = new SimpleLoggingResultStoringDAO4Debugging();
    runTest(experimentTaskId, experimentDAO, new EvaluatorFactory(), configuration,
                new F1MeasureTestingObserver(this, experimentTaskId, experimentDAO, expectedResults));

If this test passes, the majority of the work is already done πŸ™‚

MichaelRoeder avatar Aug 08 '25 12:08 MichaelRoeder

Hi @MichaelRoeder πŸ™‚

I followed the suggested approach (running the lookup once, storing the ERD dataset as NIF using TurtleNIFWriter, and validating against the original dataset as described). The NIF file is being generated and parsed back correctly, but the validation step still gives an F1-score of 0.9765 instead of the expected 1.0.

I’ve checked the implementation against the instructions and confirmed that the lookup and writing steps are working as intended.

Could you please advise what additional steps or adjustments we should try to achieve the perfect match?

Thanks!

Ajaykumarchawla avatar Aug 14 '25 09:08 Ajaykumarchawla

Can you please push your changes into a separate branch? It is much easier to make suggestions if the code is shared πŸ˜‰

MichaelRoeder avatar Aug 15 '25 08:08 MichaelRoeder

Sure, I’ve pushed my changes to a separate branch here: https://github.com/dice-group/gerbil/tree/erd-nif-fix

Ajaykumarchawla avatar Aug 15 '25 10:08 Ajaykumarchawla

The class seems to do what it should πŸ‘

If we take a look at the log, we can see the following interesting lines:

2025-08-19 11:44:15,381 [main] WARN [org.aksw.gerbil.io.nif.utils.NIFPositionHelper] - <Found an abnormal marking that has a letter directly behind it: "'condo's in florida">
2025-08-19 11:44:15,383 [main] WARN [org.aksw.gerbil.io.nif.utils.NIFPositionHelper] - <Found an abnormal marking that has a letter directly behind it: "'ritz carlto'n lake las vegas">

These two lines are not the exact source of the problem but show that the positions of two annotations are not fully correct. It could be worth investigating why this happens. I guess it is a problem in the original ERD dataset.

2025-08-19 11:45:01,415 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a true positive ((0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Kentucky)])).>
2025-08-19 11:45:01,415 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Florida)])).>
2025-08-19 11:45:01,415 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Minnesota)])).>
...
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a true positive ((0, 13, [http://dbpedia.org/resource/The_Music_Man])).>
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 13, [http://dbpedia.org/resource/The_Music_Man_(1962_film)])).>
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 13, [http://dbpedia.org/resource/The_Music_Man_(2003_film)])).>
...
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a true positive ((0, 17, [http://dbpedia.org/resource/Mary,_Mary,_Quite_Contrary])).>
2025-08-19 11:45:01,421 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(musical)])).>
2025-08-19 11:45:01,422 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(1993_film)])).>
2025-08-19 11:45:01,422 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(1949_film)])).>
2025-08-19 11:45:01,422 [Thread-1] DEBUG [org.aksw.gerbil.matching.impl.MatchingsCounterImpl] - <Found a false negative ((0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(1987_film)])).>

These are the three documents, in which false negatives are identified. If you take a close look, you can see a pattern. For example, in the first set of messages, one IRI with East_Ridge_High_School is found while 2 other IRIs are not found. Note that all three should have been at the same position in the text (0,22).

This already looks wrong, because it would mean that a single part of the text has three different meanings at the same time, which is not really allowed in GERBIL.

We can see that the NIF reader also reads it in this way in the following lines from the log:

2025-08-19 11:44:15,661 [Thread-1] DEBUG [org.aksw.gerbil.annotator.decorator.ErrorCountingAnnotatorDecorator] - <[Test-A2KB] result=[MeaningSpan(0, 22, [http://dbpedia.org/resource/East_Ridge_High_School_(Minnesota), http://dbpedia.org/resource/East_Ridge_High_School_(Kentucky), http://dbpedia.org/resource/East_Ridge_High_School_(Florida)])]>
...
2025-08-19 11:44:15,663 [Thread-1] DEBUG [org.aksw.gerbil.annotator.decorator.ErrorCountingAnnotatorDecorator] - <[Test-A2KB] result=[MeaningSpan(0, 13, [http://dbpedia.org/resource/The_Music_Man, http://dbpedia.org/resource/The_Music_Man_(2003_film), http://dbpedia.org/resource/The_Music_Man_(1962_film)])]>
2025-08-19 11:44:15,663 [Thread-1] DEBUG [org.aksw.gerbil.annotator.decorator.ErrorCountingAnnotatorDecorator] - <[Test-A2KB] result=[MeaningSpan(0, 17, [http://dbpedia.org/resource/The_Secret_Garden_(musical), http://dbpedia.org/resource/The_Secret_Garden_(1993_film), http://dbpedia.org/resource/Mary,_Mary,_Quite_Contrary, http://dbpedia.org/resource/The_Secret_Garden_(1987_film), http://dbpedia.org/resource/The_Secret_Garden_(1949_film)])]>

So for the NIF reader, there is only one marking with three or more IRI, which all point to different named entities.

To summarize: the ERD dataset reader seems to have a problem that causes it to generate multiple annotations for a single position in the text. With your transformation, these faulty annotations are then transformed into valid RDF, and because NIF will see them as a single annotation (since they refer to the same part of the text), we get a single annotation with all the IRIs. Then, during the evaluation this leads to a lower Recall, because the original dataset contains more annotations than the one that we got from the NIF file.

Long story short: we should double check the ERD dataset and the reader to fix the problem at the source πŸ˜‰

MichaelRoeder avatar Aug 19 '25 10:08 MichaelRoeder

Okay, it is an issue of the dataset... πŸ˜•

From that perspective, your NIF file is even better than the original ERD dataset πŸ€”

So, the file that you have generated is correct as it is πŸ™‚ πŸ‘

Next steps:

4. Change the ERD dataset definition

You can find the definitions of the "well-known" datasets like ERD in the datasets.properties file.

You want to change the ERD definition to use the file-based NIF dataset classes instead of the old implementation. The OKE dataset definitions are a good example to see how it should look like.

5. Mark the old classes as deprecated

Mark the classes in the erd package as @deprecated and add a javadoc comment to explain why they are outdated.

6. Add javadoc comments

Your newly created classes contains some comments in the code (good!) but the class itself and the single methods should have javadoc comments (unless their meaning is very obvious, e.g., the main method or simple getter or setter methods).

MichaelRoeder avatar Aug 19 '25 10:08 MichaelRoeder

Hello @MichaelRoeder,

I’ve updated the dataset definition to use the NIF file, deprecated the old ERD classes, and added Javadoc. Here’s the updated branch: erd-nif-fix.

Ajaykumarchawla avatar Aug 22 '25 08:08 Ajaykumarchawla