OpenSearch [BUG] Unrecoverable translog / NPE exception on index/bulk/write

Describe the bug This is not a predictable behavior observed. At times when we are doing a series of bulk write operations on an index, we observe the following translog exception with no recovery. Once observed for one index, it begins returning the same error for all API calls. Even GET _cluster/health also starts returning null_pointer_exception consistently after that. In fact, the known API queries that have executed successfully earlier also return the same error consistently.

[2023-11-16T21:28:31,752][WARN ][r.suppressed             ] [es-data-1] path: /agent-jobs/_doc/d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50, params: {index=agent-jobs, id=d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50}
org.opensearch.transport.RemoteTransportException: [es-data-1][172.62.0.91:9300][indices:data/write/bulk[s][p]]
Caused by: org.opensearch.index.translog.TranslogException: Failed to write operation [Index{id='d7044380-8c65-4191-8kd6-0aef8259da2c+0+agent-1_deca0d6f-176a-400b-4d79-44b833e44a50', type='_doc', seqNo=2397194, primaryTerm=1, version=1, autoGeneratedIdTimestamp=-1}]
	at org.opensearch.index.translog.Translog.add(Translog.java:628) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) [opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
	at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.lang.NullPointerException
	at org.opensearch.index.translog.Translog$Index.write(Translog.java:1315) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Index.access$300(Translog.java:1196) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog$Operation.writeOperation(Translog.java:1169) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.writeOperationNoSize(Translog.java:1706) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.translog.Translog.add(Translog.java:596) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.engine.InternalEngine.index(InternalEngine.java:1073) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.index(IndexShard.java:970) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:914) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:815) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:351) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:212) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:276) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:165) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:98) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:220) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]

Upon observing this issue consistently with no recovery, we disabled all the application requests fired into opensearch, attempted a rollout restart of the statefulset, after which, we observed many shards go into unassigned state. The response of GET _cluster/allocation/explain returned: null pointer exception and exhausts all 5 retries. It doesn't recover even after resetting allocation retries.

Final resolution is to completely delete statefulset (without deleting PVC), and trigger creation again.

To Reproduce

Initialize opensearch as a 3-replica statefulsets in a kubernetes environment with master+data+ingest
Create many indices with 3+2 (p+r) indices
Perform bulk write into one-third of indices at a rate that exhausts the active thread pool for bulk (though ensuring we deploy exponential backoff when OpenSearch returns write request failure upon fully using the queue size on a node)
Upon observing translog exception and null pointer exception in all API calls including bulk/write, _cluster/health, attempt rollout restart of statefulset where you will observe NPE in allocation/explain response.
Upon deleting and recreating statefulset (withou deleting PVC), the issue will be resolved.

Expected behavior Bulk writes should go through successfully. Even if OpenSearch cluster returns failure because of the faulty query, the cluster should not entirely fail on the translog exception for all API calls with no scope of recovery.

Plugins opensearch-security

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: Ubuntu
Version: 20.04

Additional context Add any other context about the problem here.

Nov 27 '23 22:11 kushb5

@ankitkala @macohen could you please update the issue with the RCA here and an ETA for the fix? Thanks.

Jan 16 '24 11:01 kushb5

Please let know the RCA. wanted to understand why this issue is happening. Once this happens it never recovers and we have to restart the affected ES nodes or sometimes entire cluster.

Jan 18 '24 20:01 nitish2787

@macohen @ankitkala any help will be appreciated. would like to know what can we do to avoid this issue.

Mar 06 '24 18:03 nitish2787

Any update on this would be appreciated, @macohen @ankitkala .

Apr 08 '24 21:04 kushb5

It looks like we're seeing a NPE while writing to translog here

The doc ID here seems to be null. Are you using autogenerated IDs? if not, can you check the doc IDs for the docs ingested on your end?

Apr 09 '24 04:04 ankitkala

I'm wondering where the NPE is actually originating from.

I've also verified the stack trace and the code lines maps correctly with the code in 1.2. So the code link i've shared in comment above is correct.
Looking at the stack trace i can see that id shouldn't be null in this case as the trace is logging the index operation which does have the valid doc id. Also, if the id is null, NPE should originate within the out.writeString method.
Only possible source of NPE in this case can be out being null which also shouldn't be the case given the out stream is being used in the line above as well.

Apr 09 '24 07:04 ankitkala

OpenSearch OpenSearch copied to clipboard

[BUG] Unrecoverable translog / NPE exception on index/bulk/write

OpenSearch
OpenSearch copied to clipboard