hadoop_exporter copied to clipboard
[Feature request] add a datanode-exporter
building on the latest updates of https://github.com/Datatamer/hadoop_exporter which contains already an extra exporter for the 'journalnode' (added by @laferrieren) 👍
@lhoss is there any specific stats you are looking for/hoping to get out of a datanode exporter?
I might have some time this week to add a basic exporter that maps 1:1 to a datanode (ie you would want to install the exporter on each server that is a datanode)
Hi @laferrieren Having already base JVM metrics (like currently for the journalnode) is a good start (ps: those generic metrics collection code could be extracted, and re-used for all the exporters, avoiding too much code duplication)
I went through all the metrics available (see below for reference).. and here are the ones that would be especially useful, mostly around used disks, and some VolumeFailures error/block metrics (the commented ones, would be less important):
"name" : "Hadoop:service=DataNode,name=DataNodeActivity-dev4-50010",
"VolumeFailures" : 0,
"DatanodeNetworkErrors" : 7,
# "ReadBlockOpNumOps" : 4,
# "ReadBlockOpAvgTime" : 155.25,
# "WriteBlockOpNumOps" : 2777,
# "WriteBlockOpAvgTime" : 11609.653943104076,
"name" : "Hadoop:service=DataNode,name=FSDatasetState-null",
"Remaining" : 34914304,
"Capacity" : 298764926976,
"DfsUsed" : 282915610624,
"NumFailedVolumes" : 0,
"EstimatedCapacityLostTotal" : 0,
# "NumBlocksCached" : 0,
# "NumBlocksFailedToCache" : 0,
# "NumBlocksFailedToUncache" : 869
Datanode JMX metrics reference
for reference here's all the special DataNode metrics theoretically available (generic jvm metrics were omitted here):
$ curl -i http://localhost:50075/jmx | less
}, {
"name" : "Hadoop:service=DataNode,name=DataNodeActivity-dev4-50010",
"modelerType" : "DataNodeActivity-dev4-50010",
"tag.SessionId" : null,
"tag.Context" : "dfs",
"tag.Hostname" : "dev4",
"BytesWritten" : 371616975022,
"TotalWriteTime" : 2512619,
"BytesRead" : 2780495,
"TotalReadTime" : 229,
"BlocksWritten" : 2777,
"BlocksRead" : 4,
"BlocksReplicated" : 0,
"BlocksRemoved" : 869,
"BlocksVerified" : 0,
"BlockVerificationFailures" : 0,
"BlocksCached" : 0,
"BlocksUncached" : 0,
"ReadsFromLocalClient" : 0,
"ReadsFromRemoteClient" : 4,
"WritesFromLocalClient" : 1169,
"WritesFromRemoteClient" : 1608,
"BlocksGetLocalPathInfo" : 109,
"RemoteBytesRead" : 2780495,
"RemoteBytesWritten" : 214491758202,
"RamDiskBlocksWrite" : 0,
"RamDiskBlocksWriteFallback" : 0,
"RamDiskBytesWrite" : 0,
"RamDiskBlocksReadHits" : 0,
"RamDiskBlocksEvicted" : 0,
"RamDiskBlocksEvictedWithoutRead" : 0,
"RamDiskBlocksEvictionWindowMsNumOps" : 0,
"RamDiskBlocksEvictionWindowMsAvgTime" : 0.0,
"RamDiskBlocksLazyPersisted" : 0,
"RamDiskBlocksDeletedBeforeLazyPersisted" : 0,
"RamDiskBytesLazyPersisted" : 0,
"RamDiskBlocksLazyPersistWindowMsNumOps" : 0,
"RamDiskBlocksLazyPersistWindowMsAvgTime" : 0.0,
"FsyncCount" : 0,
"VolumeFailures" : 0,
"DatanodeNetworkErrors" : 7,
"ReadBlockOpNumOps" : 4,
"ReadBlockOpAvgTime" : 155.25,
"WriteBlockOpNumOps" : 2777,
"WriteBlockOpAvgTime" : 11609.653943104076,
"BlockChecksumOpNumOps" : 0,
"BlockChecksumOpAvgTime" : 0.0,
"CopyBlockOpNumOps" : 0,
"CopyBlockOpAvgTime" : 0.0,
"ReplaceBlockOpNumOps" : 0,
"ReplaceBlockOpAvgTime" : 0.0,
"HeartbeatsNumOps" : 914867,
"HeartbeatsAvgTime" : 2.2573554407361267,
"BlockReportsNumOps" : 130,
"BlockReportsAvgTime" : 19.200000000000006,
"IncrementalBlockReportsNumOps" : 8363,
"IncrementalBlockReportsAvgTime" : 4.572641396627999,
"CacheReportsNumOps" : 0,
"CacheReportsAvgTime" : 0.0,
"PacketAckRoundTripTimeNanosNumOps" : 4072945,
"PacketAckRoundTripTimeNanosAvgTime" : 2.6202025743151914E7,
"FlushNanosNumOps" : 5770250,
"FlushNanosAvgTime" : 20419.89643637526,
"FsyncNanosNumOps" : 0,
"FsyncNanosAvgTime" : 0.0,
"SendDataPacketBlockedOnNetworkNanosNumOps" : 55,
"SendDataPacketBlockedOnNetworkNanosAvgTime" : 3365114.454545455,
"SendDataPacketTransferNanosNumOps" : 55,
"SendDataPacketTransferNanosAvgTime" : 229675.47272727254
}, {
"name" : "Hadoop:service=DataNode,name=DataNodeInfo",
"modelerType" : "org.apache.hadoop.hdfs.server.datanode.DataNode",
"XceiverCount" : 2,
"DatanodeNetworkCounts" : [ {
"key" : "/",
"value" : [ {
"key" : "networkErrors",
"value" : 3
} ]
}, {
"key" : "/",
"value" : [ {
"key" : "networkErrors",
"value" : 1
} ]
}, {
"key" : "/",
"value" : [ {
"key" : "networkErrors",
"value" : 3
} ]
} ],
"Version" : "2.7.3",
"RpcPort" : "50020",
"HttpPort" : null,
"NamenodeAddresses" : "{\"dev1\":\"BP-843475092-\",\"dev4\":\"BP-843475092-\"}",
"VolumeInfo" : "{\"/mnt/hdfs01/hdfs-slave-datadir/current\":{\"usedSpace\":282915610624,\"freeSpace\":34914304,\"reservedSpace\":10737418240}}",
"ClusterId" : "CID-xxxxxx"
}, {
"name" : "Hadoop:service=DataNode,name=FSDatasetState-null",
"modelerType" : "org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl",
"Remaining" : 34914304,
"StorageInfo" : "FSDataset{dirpath='[/mnt/hdfs01/hdfs-slave-datadir/current]'}",
"Capacity" : 298764926976,
"DfsUsed" : 282915610624,
"CacheCapacity" : 0,
"CacheUsed" : 0,
"NumFailedVolumes" : 0,
"FailedStorageLocations" : [ ],
"LastVolumeFailureDate" : 0,
"EstimatedCapacityLostTotal" : 0,
"NumBlocksCached" : 0,
"NumBlocksFailedToCache" : 0,
"NumBlocksFailedToUncache" : 869
}, {
@laferrieren any progress ?
@lhoss Sorry, was a little absent minded and forgot to push it back up to github, here is a branch with a datanode exporter (https://github.com/laferrieren/hadoop_exporter/tree/datanode), working on getting back upstream.
awesome @laferrieren 👍 ( PR https://github.com/Datatamer/hadoop_exporter/pull/4/files ) we will gladly help with testing it !
quick heads up, i did some initial tests already deployed the datanode exporter (from this PR commit) on our test cluster, and already collected metrics, working good so far 👍
on a side node, I detected a small bug that is in the (duplicated) code in all exporters, thus also in the new datanode exporter: https://github.com/wyukawa/hadoop_exporter/issues/5 ( ps: it would def. be a great idea to refactor out the common logic of those exporters )