hadoop_exporter icon indicating copy to clipboard operation
hadoop_exporter copied to clipboard

[Feature request] add a datanode-exporter

Open lhoss opened this issue 7 years ago • 6 comments

building on the latest updates of https://github.com/Datatamer/hadoop_exporter which contains already an extra exporter for the 'journalnode' (added by @laferrieren) 👍

lhoss avatar Sep 19 '17 16:09 lhoss

@lhoss is there any specific stats you are looking for/hoping to get out of a datanode exporter?

I might have some time this week to add a basic exporter that maps 1:1 to a datanode (ie you would want to install the exporter on each server that is a datanode)

laferrieren avatar Sep 19 '17 17:09 laferrieren

Hi @laferrieren Having already base JVM metrics (like currently for the journalnode) is a good start (ps: those generic metrics collection code could be extracted, and re-used for all the exporters, avoiding too much code duplication)

I went through all the metrics available (see below for reference).. and here are the ones that would be especially useful, mostly around used disks, and some VolumeFailures error/block metrics (the commented ones, would be less important):

    "name" : "Hadoop:service=DataNode,name=DataNodeActivity-dev4-50010",

    "VolumeFailures" : 0,
    "DatanodeNetworkErrors" : 7,
#  "ReadBlockOpNumOps" : 4,
#  "ReadBlockOpAvgTime" : 155.25,
#  "WriteBlockOpNumOps" : 2777,
#   "WriteBlockOpAvgTime" : 11609.653943104076,

    "name" : "Hadoop:service=DataNode,name=FSDatasetState-null",

    "Remaining" : 34914304,
    "Capacity" : 298764926976,
    "DfsUsed" : 282915610624,
    "NumFailedVolumes" : 0,
    "EstimatedCapacityLostTotal" : 0,
#  "NumBlocksCached" : 0,
#  "NumBlocksFailedToCache" : 0,
#  "NumBlocksFailedToUncache" : 869

Datanode JMX metrics reference

for reference here's all the special DataNode metrics theoretically available (generic jvm metrics were omitted here):

$ curl -i http://localhost:50075/jmx | less

...
}, {
    "name" : "Hadoop:service=DataNode,name=DataNodeActivity-dev4-50010",
    "modelerType" : "DataNodeActivity-dev4-50010",
    "tag.SessionId" : null,
    "tag.Context" : "dfs",
    "tag.Hostname" : "dev4",
    "BytesWritten" : 371616975022,
    "TotalWriteTime" : 2512619,
    "BytesRead" : 2780495,
    "TotalReadTime" : 229,
    "BlocksWritten" : 2777,
    "BlocksRead" : 4,
    "BlocksReplicated" : 0,
    "BlocksRemoved" : 869,
    "BlocksVerified" : 0,
    "BlockVerificationFailures" : 0,
    "BlocksCached" : 0,
    "BlocksUncached" : 0,
    "ReadsFromLocalClient" : 0,
    "ReadsFromRemoteClient" : 4,
    "WritesFromLocalClient" : 1169,
    "WritesFromRemoteClient" : 1608,
    "BlocksGetLocalPathInfo" : 109,
    "RemoteBytesRead" : 2780495,
    "RemoteBytesWritten" : 214491758202,
    "RamDiskBlocksWrite" : 0,
    "RamDiskBlocksWriteFallback" : 0,
    "RamDiskBytesWrite" : 0,
    "RamDiskBlocksReadHits" : 0,
    "RamDiskBlocksEvicted" : 0,
    "RamDiskBlocksEvictedWithoutRead" : 0,
    "RamDiskBlocksEvictionWindowMsNumOps" : 0,
    "RamDiskBlocksEvictionWindowMsAvgTime" : 0.0,
    "RamDiskBlocksLazyPersisted" : 0,
    "RamDiskBlocksDeletedBeforeLazyPersisted" : 0,
    "RamDiskBytesLazyPersisted" : 0,
    "RamDiskBlocksLazyPersistWindowMsNumOps" : 0,
    "RamDiskBlocksLazyPersistWindowMsAvgTime" : 0.0,
    "FsyncCount" : 0,
    "VolumeFailures" : 0,
    "DatanodeNetworkErrors" : 7,
    "ReadBlockOpNumOps" : 4,
    "ReadBlockOpAvgTime" : 155.25,
    "WriteBlockOpNumOps" : 2777,
    "WriteBlockOpAvgTime" : 11609.653943104076,
    "BlockChecksumOpNumOps" : 0,
    "BlockChecksumOpAvgTime" : 0.0,
    "CopyBlockOpNumOps" : 0,
    "CopyBlockOpAvgTime" : 0.0,
    "ReplaceBlockOpNumOps" : 0,
    "ReplaceBlockOpAvgTime" : 0.0,
    "HeartbeatsNumOps" : 914867,
    "HeartbeatsAvgTime" : 2.2573554407361267,
    "BlockReportsNumOps" : 130,
    "BlockReportsAvgTime" : 19.200000000000006,
    "IncrementalBlockReportsNumOps" : 8363,
    "IncrementalBlockReportsAvgTime" : 4.572641396627999,
    "CacheReportsNumOps" : 0,
    "CacheReportsAvgTime" : 0.0,
    "PacketAckRoundTripTimeNanosNumOps" : 4072945,
    "PacketAckRoundTripTimeNanosAvgTime" : 2.6202025743151914E7,
    "FlushNanosNumOps" : 5770250,
    "FlushNanosAvgTime" : 20419.89643637526,
    "FsyncNanosNumOps" : 0,
    "FsyncNanosAvgTime" : 0.0,
    "SendDataPacketBlockedOnNetworkNanosNumOps" : 55,
    "SendDataPacketBlockedOnNetworkNanosAvgTime" : 3365114.454545455,
    "SendDataPacketTransferNanosNumOps" : 55,
    "SendDataPacketTransferNanosAvgTime" : 229675.47272727254
 }, {
    "name" : "Hadoop:service=DataNode,name=DataNodeInfo",
    "modelerType" : "org.apache.hadoop.hdfs.server.datanode.DataNode",
    "XceiverCount" : 2,
    "DatanodeNetworkCounts" : [ {
      "key" : "/192.168.161.103",
      "value" : [ {
        "key" : "networkErrors",
        "value" : 3
      } ]
    }, {
      "key" : "/192.168.161.101",
      "value" : [ {
        "key" : "networkErrors",
        "value" : 1
      } ]
    }, {
      "key" : "/192.168.161.104",
      "value" : [ {
        "key" : "networkErrors",
        "value" : 3
      } ]
    } ],
    "Version" : "2.7.3",
    "RpcPort" : "50020",
    "HttpPort" : null,
    "NamenodeAddresses" : "{\"dev1\":\"BP-843475092-192.168.161.101-1483356716950\",\"dev4\":\"BP-843475092-192.168.161.101-1483356716950\"}",
    "VolumeInfo" : "{\"/mnt/hdfs01/hdfs-slave-datadir/current\":{\"usedSpace\":282915610624,\"freeSpace\":34914304,\"reservedSpace\":10737418240}}",
    "ClusterId" : "CID-xxxxxx"
  }, {
    "name" : "Hadoop:service=DataNode,name=FSDatasetState-null",
    "modelerType" : "org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl",
    "Remaining" : 34914304,
    "StorageInfo" : "FSDataset{dirpath='[/mnt/hdfs01/hdfs-slave-datadir/current]'}",
    "Capacity" : 298764926976,
    "DfsUsed" : 282915610624,
    "CacheCapacity" : 0,
    "CacheUsed" : 0,
    "NumFailedVolumes" : 0,
    "FailedStorageLocations" : [ ],
    "LastVolumeFailureDate" : 0,
    "EstimatedCapacityLostTotal" : 0,
    "NumBlocksCached" : 0,
    "NumBlocksFailedToCache" : 0,
    "NumBlocksFailedToUncache" : 869
  }, {

lhoss avatar Sep 20 '17 08:09 lhoss

@laferrieren any progress ?

lhoss avatar Oct 17 '17 11:10 lhoss

@lhoss Sorry, was a little absent minded and forgot to push it back up to github, here is a branch with a datanode exporter (https://github.com/laferrieren/hadoop_exporter/tree/datanode), working on getting back upstream.

laferrieren avatar Oct 17 '17 13:10 laferrieren

awesome @laferrieren 👍 ( PR https://github.com/Datatamer/hadoop_exporter/pull/4/files ) we will gladly help with testing it !

lhoss avatar Oct 17 '17 13:10 lhoss

quick heads up, i did some initial tests already deployed the datanode exporter (from this PR commit) on our test cluster, and already collected metrics, working good so far 👍

on a side node, I detected a small bug that is in the (duplicated) code in all exporters, thus also in the new datanode exporter: https://github.com/wyukawa/hadoop_exporter/issues/5 ( ps: it would def. be a great idea to refactor out the common logic of those exporters )

lhoss avatar Oct 23 '17 15:10 lhoss