gerbil icon indicating copy to clipboard operation
gerbil copied to clipboard

File-based cache can slow down experiments

Open MichaelRoeder opened this issue 2 years ago • 3 comments

Problem

If the file-based sameAs cache reaches a larger size, serializing it takes up some time. During this time, the serializing thread owns all semaphore permits of the cache and no other thread can make use of the cache.

The first thread in the following status is blocked because the second thread is writing the cache to a file.

eTConfig(XXX,XXX,"QA","STRONG_ENTITY_MATCH")
state=WAITING
progress=null
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
org.aksw.gerbil.semantic.sameas.impl.cache.FileBasedCachingSameAsRetriever.retrieveSameURIs(FileBasedCachingSameAsRetriever.java:116)
org.aksw.gerbil.semantic.sameas.impl.AbstractSameAsRetrieverDecorator.addSameURIs(AbstractSameAsRetrieverDecorator.java:43)
org.aksw.gerbil.semantic.sameas.SameAsRetrieverUtils.addSameURIsToMeanings(SameAsRetrieverUtils.java:39)
org.aksw.gerbil.semantic.sameas.SameAsRetrieverUtils.addSameURIsToMarkings(SameAsRetrieverUtils.java:32)
org.aksw.gerbil.dataset.AbstractDatasetConfiguration.getPreparedDataset(AbstractDatasetConfiguration.java:100)
org.aksw.gerbil.dataset.AbstractDatasetConfiguration.getDataset(AbstractDatasetConfiguration.java:74)
org.aksw.gerbil.execute.ExperimentTask.run(ExperimentTask.java:122)
org.aksw.simba.topicmodeling.concurrent.workers.WorkerImpl.run(WorkerImpl.java:44)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

eTConfig(XXX,XXX,"QA","STRONG_ENTITY_MATCH")
state=RUNNABLE
progress=100.0% of dataset
java.io.FileOutputStream.writeBytes(Native Method)
java.io.FileOutputStream.write(FileOutputStream.java:326)
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1890)
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1108)
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
org.aksw.gerbil.semantic.sameas.impl.cache.FileBasedCachingSameAsRetriever.performCacheStorage(FileBasedCachingSameAsRetriever.java:258)
org.aksw.gerbil.semantic.sameas.impl.cache.FileBasedCachingSameAsRetriever.requestUri(FileBasedCachingSameAsRetriever.java:184)
org.aksw.gerbil.semantic.sameas.impl.cache.FileBasedCachingSameAsRetriever.retrieveSameURIs(FileBasedCachingSameAsRetriever.java:135)
org.aksw.gerbil.semantic.sameas.impl.AbstractSameAsRetrieverDecorator.addSameURIs(AbstractSameAsRetrieverDecorator.java:43)
org.aksw.gerbil.execute.ExperimentTask.runExperiment(ExperimentTask.java:560)
org.aksw.gerbil.execute.ExperimentTask.run(ExperimentTask.java:167)
org.aksw.simba.topicmodeling.concurrent.workers.WorkerImpl.run(WorkerImpl.java:44)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Solution

  • [ ] Improve the writing speed (if possible)
  • [ ] Allow reading operations while writing the cache to file

MichaelRoeder avatar Oct 26 '22 14:10 MichaelRoeder

https://github.com/RuedigerMoeller/fast-serialization might be an option

MichaelRoeder avatar Oct 26 '22 16:10 MichaelRoeder

It seems like the low number of changes that are allowed before the cache is written to the hard disk caused GERBIL to write the cache at least once per minute. Making the threshold configurable and increasing it to 100k improved the runtime of GERBIL QA a lot (https://github.com/dice-group/gerbil/commit/b2778732f3e8ed83eb5f8055150987a5bacf5ba5).

MichaelRoeder avatar Oct 27 '22 16:10 MichaelRoeder

We applied the same change to the master branch.

MichaelRoeder avatar May 15 '23 10:05 MichaelRoeder