open-semantic-search icon indicating copy to clipboard operation
open-semantic-search copied to clipboard

aye

Open MparkG opened this issue 2 years ago • 5 comments

I joyfully began to update oss from 21.01.03 to 21.12.26 via deb package. instantly it breaks the setup of apache because i happen to have another configuration for the web interface, i dont use the opensemanticsearch-django-webapps.conf in apaches conf dir. Right, i can bypass that by manually disabling it again, each time i update. now the update continues, and the next step is to break the opensemanticsearch-entities index:

Caused by: org.apache.lucene.index.CorruptIndexException: codec mismatch: actual codec=Lucene50PostingsWriterDoc vs expected codec=Lucene84PostingsWriterDoc (resource=MMapIndexInput(path="/media/data_2/solr_data/opensemanticsearch-entities/data/index/_4_FST50_0.doc"))

Gladly i dont have a clue of Solr, and the suggestions on the internet to check the index dont work because there is no CheckIndex class to be found as all the web posts suggest.

So i move the opensemanticsearch-entities core to the side and attemt to let the installer.deb create it newly. So the script continues.... and stop again;

{
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"SolrCore is loading",
    "code":503}}

The installer has not waited for the core to be loaded, hence solr throws an error the core doesnt exist.

meh.

MparkG avatar Jan 24 '22 13:01 MparkG

now that i have taken apart the deb and manually finished it i get this:

java[1950828]: ERROR [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher Timeout task PARSE, millis elapsed 300091, timeoutMillis 300000, file id b'World History.pdf'consider increasing the allowable time with the <taskTimeoutMillis/> parameter or the X-Tika-Timeout-Millis header
Jan 27 22:22:34 mgp java[1950828]: WARN  [Thread-22] 22:22:34,199 org.apache.tika.server.core.ServerStatusWatcher forked process observed TIMEOUT and is shutting down.
Jan 27 22:22:34 mgp java[1950828]: INFO  [Thread-22] 22:22:34,214 org.apache.tika.server.core.ServerStatusWatcher Shutting down forked process with status: TIMEOUT
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Connection to Tika server (will retry in 120 seconds) failed. Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Jan 27 22:22:34 mgp etl_tasks[2349205]: [2022-01-27 22:22:34,677: WARNING/ForkPoolWorker-3] Retrying to connect to Tika server in 120 second(s).
Jan 27 22:22:34 mgp java[1929662]: INFO  [pool-2-thread-1] 22:22:34,678 org.apache.tika.server.core.TikaServerWatchDog forked process exited with exit value 3
Jan 27 22:22:36 mgp java[1961770]: INFO  [main] 22:22:36,867 org.apache.tika.server.core.TikaServerProcess Starting Apache Tika 2.2.1 server
Jan 27 22:22:37 mgp java[1961770]: INFO  [main] 22:22:37,014 org.apache.tika.server.core.TikaServerProcess Using custom config: /etc/tika/tika-config-fakecache.xml
Jan 27 22:22:37 mgp java[1961770]: INFO  [main] 22:22:37,897 org.apache.cxf.endpoint.ServerImpl Setting the server's publish address to be http://localhost:9999/

https://github.com/opensemanticsearch/tika-server.deb/issues/10 I suppose the 300000ms is hardcoded in the jar.

so adding headers['X-Tika-Timeout-Millis'] = "172800000" in /usr/lib/python3/dist-packages/opensemanticetl/enhance_extract_text_tika_server.py does not work, tika refuses as its longer than whats in server config.

MparkG avatar Jan 27 '22 21:01 MparkG

While im at it; I also get:

java[2296792]: ERROR [qtp1803093683-30] 21:47:54,777 org.apache.pdfbox.contentstream.PDFStreamEngine Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

MparkG avatar Jan 28 '22 20:01 MparkG

Found where the java tika server is configured, by looking at the example configuration from apache. set it to 2 days. now it works.

still more error though

Feb 23 19:12:35 mgp java[324631]: Feb 23, 2022 7:12:35 PM org.apache.pdfbox.jbig2.util.log.JDKLogger error
Feb 23 19:12:35 mgp java[324631]: SCHWERWIEGEND: No global segment added so far. Use JBIG2ImageReader.setGlobals().

MparkG avatar Feb 23 '22 18:02 MparkG

@MparkG Where did you find the file to set the timeout?

bsher-osi avatar May 04 '23 00:05 bsher-osi

@bsher-osi /etc/tika/tika-config-fakecache.xml:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <server>
    <params>
      <!-- maximum time to allow per parse before shutting down and restarting
          the forked parser. Not allowed if nofork=true. -->
      <taskTimeoutMillis>500000</taskTimeoutMillis>
    </params>
  </server>
  <parsers>
     ..... etcetera

vsessink avatar Apr 19 '24 13:04 vsessink