nrtsearch
nrtsearch copied to clipboard
Different behaviors for start index in primary and replica
Following are the results when nrtsearch is started with restored state and start index is called:
- Primary: start index fails with index not saved or committed message in exception (correction - no segments file found), subsequent start index with restore also fails since directories were created
- Replica: start index works and index is started with 0 segments. It also didn't seem like the replica was retrieving the segments from primary after this.
@sarthakn7 I was not able to reproduce either of the behavior you mention above.
These are the steps I tried to reproduce the scenarios you mention:
Replica Start JVM JAVA_OPTS="-Xms16g -Xmx16g -Xss256k -XX:+UseG1GC -XX:+UseCompressedOops -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5006" ./build/install/nrtsearch/bin/lucene-server ~/scratch/nrtsearch/talk_v1/nrtsearch_generic_replica.yaml
start index in restore mode as replica curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/talk/startIndexReplicaRestore.json
stop index curl -XPOST localhost:9900/v1/stop_index -d '{"indexName": "talk_v1"}'
start index without restore curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/talk/startIndexReplica.json
Logs. Also verified using v1/indices Jun 09, 2020 1:27:46 PM com.yelp.nrtsearch.server.grpc.LuceneServer$LuceneServerImpl startIndex INFO: StartIndexHandler returned maxDoc: 22562973 numDocs: 22562973
Primary start jvm JAVA_OPTS="-Xms16g -Xmx16g -Xss256k -XX:+UseG1GC -XX:+UseCompressedOops -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5006" ./build/install/nrtsearch/bin/lucene-server ~/scratch/platypus/talk_v1/nrtsearch_generic_primary.yaml
start index restore mode as primary curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/lists_v1/startIndexPrimaryRestore.json
stop index curl -XPOST localhost:9900/v1/stop_index -d '{"indexName": "talk_v1"}'
start index non restore mode as primary umesh@dev24-uswest1cdevc:~/scratch/platypus/lists_v1$ curl -XPOST localhost:9900/v1/start_index -d @/nail/home/umesh/scratch/platypus/lists_v1/startIndexPrimary.json
logs {"maxDoc":22562973,"numDocs":22562973,"segments":"StandardDirectoryReader(segments_2:501397:nrt
More detailed steps to reproduce:
- Delete all state, index and archiver directories
- Start JVM with
restoreState: true
For primary: 3. Start index without restore - fails with index not saved or committed message in exception (correction - no segments file found) 4. Start index with restore - fails with directory already present exception
For replica: 3. Start index without restore - works fine, index is started with 0 segments
1 ensures that we delete all local state and index data. 2 ensures we get the state back (names of indexes previously backed up/committed)
Both 3 for primary and replica are bad inputs since we are essentially saying "I have my previous state use that and start the indexes I know of." We assume the index dir is present at this time.
- Primary tries to create an IndexWriter and fails since we have no segments file (Note: this is still not error you report above)
- Replica does not try to create an indexWriter and thus simply creates the stub dirs (which primary also does before it fails on creation of IndexWriter).
Stack Trace on failure to create IndexWriter (for primary:3 above)
Jun 09, 2020 2:48:54 PM com.yelp.nrtsearch.server.luceneserver.StartIndexHandler handle
SEVERE: Cannot start IndexState/ShardState
org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(MMapDirectory@/nail/home/umesh/nrtsearch/primary_index/talk_v1/shard0/index lockFactory=org.apache.lucene.sto$e.NativeFSLockFactory@32a9cbe8): files: [write.lock]
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:841)
at com.yelp.nrtsearch.server.luceneserver.ShardState.startPrimary(ShardState.java:654)
at com.yelp.nrtsearch.server.luceneserver.StartIndexHandler.handle(StartIndexHandler.java:91)
at com.yelp.nrtsearch.server.grpc.LuceneServer$LuceneServerImpl.startIndex(LuceneServer.java:363)
at com.yelp.nrtsearch.server.grpc.LuceneServerGrpc$MethodHandlers.invoke(LuceneServerGrpc.java:2352)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
So I think we can deal with this as a bad user input. That is if we
- already have a state dir
- and we do not have an index dir
Means the only valid
start_index
operation in this state isrestore
and any other start_index (without restore) should be rejected sooner. @sarthakn7 Let me know if this approach makes sense and I code it up.
@umeshdangat yes that makes sense 👍