indextank-engine Embedded API raises exception during search after restart

If I start the embedded api and index some documents, I'm able to query them. If I stop and restart the embedded api, if I query for any document that was previously in the index, the embedded API throws the IndextankException below. Searching for a term that wasn't previously in the index returns a correct json result of zero matches.

I am using the default sample-engine-config and running on OS X.

Is there something I'm doing wrong here? Do I have to do something to trigger a reload of the previously indexed documents?

/var/www/indextank/indextank-engine$ java -cp target/indextank-engine-1.0.0-jar-with-dependencies.jar com.flaptor.indextank.api.Launcher 
WARN  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [log4j.properties not found on classpath!] 2012-01-15 12:56:30,351
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'environment-prefix' set to TEST] 2012-01-15 12:56:30,359
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'facets' set to true] 2012-01-15 12:56:30,359
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'index-code' set to dbajo] 2012-01-15 12:56:30,359
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Command line option 'conf-file' set to sample-engine-config] 2012-01-15 12:56:30,365
INFO  [main] com.flaptor.indextank.suggest.NewPopularityIndex - [Loading popularity index terms from disk.] 2012-01-15 12:56:30,724
INFO  [main] com.flaptor.indextank.suggest.NewPopularityIndex - [Terms loaded] 2012-01-15 12:56:30,725
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Index recovery configuration set to recover index from simpleDB] 2012-01-15 12:56:30,725
INFO  [main] com.flaptor.indextank.index.storage.InMemoryStorage - [Starting a new(empty) InMemoryStorage.] 2012-01-15 12:56:30,726
INFO  [main] com.flaptor.indextank.api.EmbeddedIndexEngine - [Using in-memory storage] 2012-01-15 12:56:30,727
INFO  [main] org.eclipse.jetty.util.log - [jetty-7.x.y-SNAPSHOT] 2012-01-15 12:56:30,790
INFO  [main] org.eclipse.jetty.util.log - [started o.e.j.s.ServletContextHandler{/,null}] 2012-01-15 12:56:30,821
INFO  [main] org.eclipse.jetty.util.log - [Started [email protected]:20220 STARTING] 2012-01-15 12:56:30,849
IndextankException(message:null)
    at com.flaptor.indextank.api.IndexEngineApi.search(IndexEngineApi.java:94)
    at com.flaptor.indextank.api.resources.Search.run(Search.java:79)
    at com.ghosthack.turismo.servlet.Servlet.service(Servlet.java:55)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:538)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:478)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:937)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:183)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:871)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110)
    at org.eclipse.jetty.server.Server.handle(Server.java:346)
    at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:589)
    at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:601)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:214)
    at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:411)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:535)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:529)
    at java.lang.Thread.run(Thread.java:637)

Jan 15 '12 17:01 lsemel

I have a similar problem, after the restart and search previously indexed documents: curl "http://localhost:20220/v1/indexes/idx/search?q=ipsum", I get correct result. But if I try to do advanced searches, like the following: curl "http://localhost:20220/v1/indexes/idx/search?q=ipsum&snippet=text", I get "Service unavailable" and the following exception in IndexTank:

com.flaptor.indextank.api.IndexEngineApiException: java.lang.NullPointerException at com.flaptor.indextank.api.IndexEngineApi.search(IndexEngineApi.java:90) at com.flaptor.indextank.api.resources.Search.run(Search.java:76) at com.ghosthack.turismo.servlet.Servlet.service(Servlet.java:55) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:538) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:478) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:937) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:183) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:871) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110) at org.eclipse.jetty.server.Server.handle(Server.java:346) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:589) at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:601) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:214) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:411) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:535) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:529) at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.NullPointerException at java.io.ByteArrayInputStream.(ByteArrayInputStream.java:106) at com.flaptor.indextank.index.storage.DocumentBinaryStorage.decompress(DocumentBinaryStorage.java:98) at com.flaptor.indextank.index.storage.DocumentBinaryStorage.getDocument(DocumentBinaryStorage.java:70) at com.flaptor.indextank.search.SnippetSearcher.search(SnippetSearcher.java:96) at com.flaptor.indextank.api.IndexEngineApi.search(IndexEngineApi.java:83) ... 22 more

Jun 08 '12 09:06 MikeQG

Does anyone have workaround for the issue? Reindexing whole dataset after restart is, well, waste of resources and does not help in availability.

Jun 08 '12 09:06 myroslav

I think I have tracked down the cause of this.

The problem occurs because IndexTank is expecting the InMemoryStorage instance to be in a particular state after startup however depending on how the engine is bootstrapped it may not have been initialized correctly.

When starting an instance of the EmbeddedIndexEngine you MUST specify the parameter:

--load-state true

for example:

final String base = realPath + "/indextank/";
final String [] params = new String[]{
        "--facets", 
        "--rti-size", "500", 
        "--conf-file", realPath + "/sample-engine-config", 
        "--port", Configuration.port + "", // indexer port+1, searcher port+2, suggestor port+3
        "--environment-prefix", "UTOPIO", 
        "--recover", 
        "--dir", base, 
        "--load-state", "true", 
        "--snippets", 
        "--suggest", "documents", 
        "--boosts", "3", 
        "--index-code", Configuration.indexCode, 
        "--functions", "0:-age", 
        };
new File(base).mkdirs();
engine = EmbeddedIndexEngine.instantiate(params);

However I did notice that if I had an index that was already in a "bad" state providing this parameter resulted in a lot of other errors. It seems that if you start from scratch with this parameter it's all good.

Unfortunately this seems like an incredibly flakey/unreliable situation. If for any reason the InMemoryStorage instance fails to load successfully your whole index is basically useless.

I'm still tracing through the code to try to work out how this can be made more robust. It may be that it IS a robust solution and I'm just missing the point of course

Jun 20 '12 02:06 jasonpolites

Follow up...

The default implementation of the EmbeddedIndexEngine seems to only allow the use of this InMemoryStorage instance:

Snippet from EmbeddedIndexEngine

switch (storageValue) {
    case RAM:
        storage = new InMemoryStorage(baseDir, load);
        logger.info("Using in-memory storage");
        break;
    case NO:
        storage = null;
        logger.info("NOT Using storage");
        break;
}

I'm assuming IndexTank is using this Document storage to maintain a complete copy of the original document that was indexed, presumably because the underlying Lucene instance has been instructed to only index document fields and not store them. Index only would seem to be a sensible option however I would also assume that in almost all cases the user of IndexTank will already have a document storage system and would not need IndexTank to manage this itself.

Unfortunately there also does not seem to be an easy way to instruct the engine to NOT use storage. Despite the snippet above, the engine also has this:

StorageValues storageValue = StorageValues.RAM;
int bdbCache = 0;
if (line.hasOption("storage")){
    String storageType = line.getOptionValue("storage");
    if ("bdb".equals(storageType)) {
        storageValue = StorageValues.BDB;
        bdbCache = Integer.parseInt(line.getOptionValue("bdb-cache", String.valueOf(DEFAULT_BDB_CACHE)));
    } else if ("cassandra".equals(storageType)) {
        storageValue = StorageValues.CASSANDRA;
    } else if ("ram".equals(storageType)) {
        storageValue = StorageValues.RAM;
    } else {
        throw new IllegalArgumentException("storage has to be 'cassandra', 'bdb' or 'ram'. '" + storageType + "' given.");
    }
}

Of course none of these other values will every actually work because of the code in the first snippet.

Confusing...

Jun 20 '12 03:06 jasonpolites

I need help with this issue. I have edited the file and added --load-state true but when i start the service i receive an Starting a new(empty) InMemoryStorage. Load was requested but no file was found.

i just need to start my index, add documents, stop it, start it again y be able to search the past documents. Help!

Oct 01 '12 22:10 santiagovillegasg

Please keep in mind the IndexTank-engine was created as a way to make IndexTank easy to use as it was open-sourced. Originally it was part of IndexTank-service, and as such the recovery was provided by the "LogStorage" (a component of the service which is absent in the stand-alone engine) and indexes were killed and respawned routinely by the "Nebu" component, transparently for the user. You can still use this setup if you want to venture in that direction.

So the stand-alone engine needs to be fed all the documents again after restart. The recovery time typically depends on the speed of the data source, as the engine can take documents much faster than a normal disk-based source can spew them. But unless you have a really large number of documents, it should take only a few seconds.

Oct 01 '12 23:10 jhandl