OpenSearch
                                
                                 OpenSearch copied to clipboard
                                
                                    OpenSearch copied to clipboard
                            
                            
                            
                        [BUG] Unrecoverable translog / NPE exception on index/bulk/write
Describe the bug
This is not a predictable behavior observed. At times when we are doing a series of bulk write operations on an index, we observe the following translog exception with no recovery. Once observed for one index, it begins returning the same error for all API calls. Even GET _cluster/health also starts returning null_pointer_exception consistently after that. In fact, the known API queries that have executed successfully earlier also return the same error consistently.
[2023-11-16T21:28:31,752][WARN ][r.suppressed             ] [es-data-1] path: /agent-jobs/_doc/d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50, params: {index=agent-jobs, id=d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50}
org.opensearch.transport.RemoteTransportException: [es-data-1][172.62.0.91:9300][indices:data/write/bulk[s][p]]
Caused by: org.opensearch.index.translog.TranslogException: Failed to write operation [Index{id='d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50', type='_doc', seqNo=2397194, primaryTerm=1, version=1, autoGeneratedIdTimestamp=-1}]
	at org.opensearch.index.translog.Translog.add(Translog.java:628) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
	at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.lang.NullPointerException
	at org.opensearch.index.translog.Translog$Index.write(Translog.java:1315) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Index.access$300(Translog.java:1196) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Operation.writeOperation(Translog.java:1169) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.writeOperationNoSize(Translog.java:1706) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.add(Translog.java:596) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]
Upon observing this issue consistently with no recovery, we disabled all the application requests fired into opensearch, attempted a rollout restart of the statefulset, after which, we observed many shards go into unassigned state. The response of GET _cluster/allocation/explain returned: null pointer exception and exhausts all 5 retries. It doesn't recover even after resetting allocation retries.
Final resolution is to completely delete statefulset (without deleting PVC), and trigger creation again.
To Reproduce
- Initialize opensearch as a 3-replica statefulsets in a kubernetes environment with master+data+ingest
- Create many indices with 3+2 (p+r) indices
- Perform bulk write into one-third of indices at a rate that exhausts the active thread pool for bulk (though ensuring we deploy exponential backoff when OpenSearch returns write request failure upon fully using the queue size on a node)
- Upon observing translog exception and null pointer exception in all API calls including bulk/write, _cluster/health, attempt rollout restart of statefulset where you will observe NPE in allocation/explain response.
- Upon deleting and recreating statefulset (withou deleting PVC), the issue will be resolved.
Expected behavior Bulk writes should go through successfully. Even if OpenSearch cluster returns failure because of the faulty query, the cluster should not entirely fail on the translog exception for all API calls with no scope of recovery.
Plugins opensearch-security
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: Ubuntu
- Version: 20.04
Additional context Add any other context about the problem here.
@ankitkala @macohen could you please update the issue with the RCA here and an ETA for the fix? Thanks.
Please let know the RCA. wanted to understand why this issue is happening. Once this happens it never recovers and we have to restart the affected ES nodes or sometimes entire cluster.
@macohen @ankitkala any help will be appreciated. would like to know what can we do to avoid this issue.
Any update on this would be appreciated, @macohen @ankitkala .
It looks like we're seeing a NPE while writing to translog here
The doc ID here seems to be null. Are you using autogenerated IDs? if not, can you check the doc IDs for the docs ingested on your end?
I'm wondering where the NPE is actually originating from.
- I've also verified the stack trace and the code lines maps correctly with the code in 1.2. So the code link i've shared in comment above is correct.
- Looking at the stack trace i can see that idshouldn't be null in this case as the trace is logging the index operation which does have the valid doc id. Also, if theidis null, NPE should originate within theout.writeStringmethod.
- Only possible source of NPE in this case can be outbeing null which also shouldn't be the case given the out stream is being used in the line above as well.