ozone
ozone copied to clipboard
HDDS-10488. Datanode OOM due to run out of mmap handler
What changes were proposed in this pull request?
DN is OOM with following stack found in a test cluster.
6:52:03.601 AM WARN KeyValueHandler Operation: ReadChunk , Trace ID: , Message: java.io.IOException: Map failed , Result: IO_EXCEPTION , StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: java.io.IOException: Map failed
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:471)
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:226)
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:260)
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:194)
at org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:197)
at org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:112)
at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:773)
at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:262)
at org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:225)
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:335)
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:183)
at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:182)
at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
at org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:938)
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$5(ChunkUtils.java:264)
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$4(ChunkUtils.java:218)
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:411)
at org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:215)
... 24 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:935)
... 28 more
The root cause is there is platform limit for the max mapped region of a process. This limit by default is 65530(max_map_count) on Linux platform. Every FileChannel.map() call will consume one quota. When DN runs out of this max_map_count, then it will generate this OOM exception.
The mapped buffer will be released by Java GC once the data is sent out. When there is heavy read workload on DN, there is chance that DN will exceed this max_map_count at some point.
This task adds a upper limit "ozone.chunk.read.mapped.buffer.max.count", by default is 0. Since this max_map_count configuration could vary from platform to platform, or even one the same platform. It's better let admin/user design the a appropriate max count value by themselves.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10488
How was this patch tested?
Manual test
- setup a docker cluster
- put a 56MB file
- get the same file
- check DN logs
2024-05-17 07:25:26,649 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-4] INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 127
2024-05-17 07:25:26,653 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-4] INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class java.nio.DirectByteBufferR
2024-05-17 07:25:26,713 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-9] INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 126
2024-05-17 07:25:26,714 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-9] INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class java.nio.DirectByteBufferR
2024-05-17 07:25:26,747 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 125
2024-05-17 07:25:26,748 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class java.nio.DirectByteBufferR
2024-05-17 07:25:26,765 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 124
2024-05-17 07:25:26,765 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class java.nio.DirectByteBufferR
2024-05-17 07:25:26,889 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 123
2024-05-17 07:25:26,889 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-5] INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class java.nio.DirectByteBufferR
2024-05-17 07:25:26,905 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-8] INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 122
2024-05-17 07:25:26,905 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-8] INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class java.nio.DirectByteBufferR
2024-05-17 07:25:26,929 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] INFO helpers.ChunkUtils: memmap semaphore permits decreased by 1 to total 121
2024-05-17 07:25:26,930 [490a03d2-63e7-45c0-be99-2c534a32c625-ChunkReader-1] INFO helpers.ChunkUtils: mapped: offset=0, readLen=0, n=1048576, class java.nio.DirectByteBufferR
2024-05-17 07:26:05,652 [490a03d2-63e7-45c0-be99-2c534a32c625-BlockDeletingService#0] INFO interfaces.ContainerDeletionChoosingPolicyTemplate: Chosen 0/5000 blocks from 0 candidate containers.
2024-05-17 07:27:05,654 [490a03d2-63e7-45c0-be99-2c534a32c625-BlockDeletingService#2] INFO interfaces.ContainerDeletionChoosingPolicyTemplate: Chosen 0/5000 blocks from 0 candidate containers.
2024-05-17 07:28:05,655 [490a03d2-63e7-45c0-be99-2c534a32c625-BlockDeletingService#1] INFO interfaces.ContainerDeletionChoosingPolicyTemplate: Chosen 0/5000 blocks from 0 candidate containers.
2024-05-17 07:28:51,294 [Finalizer] INFO helpers.ChunkUtils: memmap semaphore permits increased by 1 to total 122
2024-05-17 07:28:51,294 [Finalizer] INFO helpers.ChunkUtils: memmap semaphore permits increased by 1 to total 123
2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap semaphore permits increased by 1 to total 124
2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap semaphore permits increased by 1 to total 125
2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap semaphore permits increased by 1 to total 126
2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap semaphore permits increased by 1 to total 127
2024-05-17 07:28:51,295 [Finalizer] INFO helpers.ChunkUtils: memmap semaphore permits increased by 1 to total 128
Use pmap to verify mapped region status. After the mapped semaphore permits were released, the mapped regions disappeared in the output of pmap command.