nrtsearch icon indicating copy to clipboard operation
nrtsearch copied to clipboard

Different behaviors for start index in primary and replica

Open sarthakn7 opened this issue 4 years ago • 4 comments

Following are the results when nrtsearch is started with restored state and start index is called:

  1. Primary: start index fails with index not saved or committed message in exception (correction - no segments file found), subsequent start index with restore also fails since directories were created
  2. Replica: start index works and index is started with 0 segments. It also didn't seem like the replica was retrieving the segments from primary after this.

sarthakn7 avatar Jun 09 '20 18:06 sarthakn7

@sarthakn7 I was not able to reproduce either of the behavior you mention above.

These are the steps I tried to reproduce the scenarios you mention:

Replica Start JVM JAVA_OPTS="-Xms16g -Xmx16g -Xss256k -XX:+UseG1GC -XX:+UseCompressedOops -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5006" ./build/install/nrtsearch/bin/lucene-server ~/scratch/nrtsearch/talk_v1/nrtsearch_generic_replica.yaml

start index in restore mode as replica curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/talk/startIndexReplicaRestore.json

stop index curl -XPOST localhost:9900/v1/stop_index -d '{"indexName": "talk_v1"}'

start index without restore curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/talk/startIndexReplica.json

Logs. Also verified using v1/indices Jun 09, 2020 1:27:46 PM com.yelp.nrtsearch.server.grpc.LuceneServer$LuceneServerImpl startIndex INFO: StartIndexHandler returned maxDoc: 22562973 numDocs: 22562973

Primary start jvm JAVA_OPTS="-Xms16g -Xmx16g -Xss256k -XX:+UseG1GC -XX:+UseCompressedOops -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5006" ./build/install/nrtsearch/bin/lucene-server ~/scratch/platypus/talk_v1/nrtsearch_generic_primary.yaml

start index restore mode as primary curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/lists_v1/startIndexPrimaryRestore.json

stop index curl -XPOST localhost:9900/v1/stop_index -d '{"indexName": "talk_v1"}'

start index non restore mode as primary umesh@dev24-uswest1cdevc:~/scratch/platypus/lists_v1$ curl -XPOST localhost:9900/v1/start_index -d  @/nail/home/umesh/scratch/platypus/lists_v1/startIndexPrimary.json

logs {"maxDoc":22562973,"numDocs":22562973,"segments":"StandardDirectoryReader(segments_2:501397:nrt

umeshdangat avatar Jun 09 '20 20:06 umeshdangat

More detailed steps to reproduce:

  1. Delete all state, index and archiver directories
  2. Start JVM with restoreState: true

For primary: 3. Start index without restore - fails with index not saved or committed message in exception (correction - no segments file found) 4. Start index with restore - fails with directory already present exception

For replica: 3. Start index without restore - works fine, index is started with 0 segments

sarthakn7 avatar Jun 09 '20 21:06 sarthakn7

1 ensures that we delete all local state and index data. 2 ensures we get the state back (names of indexes previously backed up/committed)

Both 3 for primary and replica are bad inputs since we are essentially saying "I have my previous state use that and start the indexes I know of." We assume the index dir is present at this time.

  • Primary tries to create an IndexWriter and fails since we have no segments file (Note: this is still not error you report above)
  • Replica does not try to create an indexWriter and thus simply creates the stub dirs (which primary also does before it fails on creation of IndexWriter).

Stack Trace on failure to create IndexWriter (for primary:3 above)

Jun 09, 2020 2:48:54 PM com.yelp.nrtsearch.server.luceneserver.StartIndexHandler handle
SEVERE: Cannot start IndexState/ShardState
org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MMapDirectory@/nail/home/umesh/nrtsearch/primary_index/talk_v1/shard0/index lockFactory=org.apache.lucene.sto$e.NativeFSLockFactory@32a9cbe8): files: [write.lock]
        at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:841)
        at com.yelp.nrtsearch.server.luceneserver.ShardState.startPrimary(ShardState.java:654)
        at com.yelp.nrtsearch.server.luceneserver.StartIndexHandler.handle(StartIndexHandler.java:91)
        at com.yelp.nrtsearch.server.grpc.LuceneServer$LuceneServerImpl.startIndex(LuceneServer.java:363)
        at com.yelp.nrtsearch.server.grpc.LuceneServerGrpc$MethodHandlers.invoke(LuceneServerGrpc.java:2352)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
        at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.base/java.lang.Thread.run(Thread.java:832)

So I think we can deal with this as a bad user input. That is if we

  • already have a state dir
  • and we do not have an index dir Means the only valid start_index operation in this state is restore and any other start_index (without restore) should be rejected sooner. @sarthakn7 Let me know if this approach makes sense and I code it up.

umeshdangat avatar Jun 09 '20 21:06 umeshdangat

@umeshdangat yes that makes sense 👍

sarthakn7 avatar Jun 09 '20 22:06 sarthakn7