ovis icon indicating copy to clipboard operation
ovis copied to clipboard

ldms plugin component_id design change

Open morrone opened this issue 5 years ago • 6 comments

All ldms plugins are currently required to honor a "component_id" option and store it in a metadata metric in all of its metric sets. The data type of component_id is currently an LDMS_V_U64.

As I understand it, the intention is that this number will uniquely identify the component on which the ldmsd is running, e.g. a node. So one common use appears to be that if one has a compute cluster of 1000 nodes, each of the nodes will be assigned a unique integer which is passed to every ldms plugin and in turn stored in the component_id of each metric set on a node.

This probably works fine when there is only a single cluster in one's center, or if every cluster in a center has its own separate monitoring system. However it begins to become an undesirable burden when scaled up.

Consider a site that has 15 clusters and all of the monitoring data from each of the clusters is combined into a single central monitoring database. One might imagine that the configuration of such a system becomes a challenge in general. Now we try to add ldms to that system, and suggenly the system administrators have a new configuration burden to uniquely assign integer values to all nodes across all of the clusters, even though those clusters may be maintained by many different people at various times.

The sysadmins might rightfully argue that they already have a unique way to identify nodes: hostnames. The additional integer component ids just for ldms are an added configuration burden that they would not want.

So I would recommend that we can address this issue by changing the type of component_id from an integer to a string. For people that wish to assign unique integers to all nodes, they can still do that because an number can be stored in string form (and possibly converted into a real integer before final insertion into the final monitoring database). But a hostname cannot be stored in an integer.

A string component_id also accommodates other components that might not look exactly like a linux node (network switches?).

I would recommend also that the default value of the component_id string be the hostname on systems that support an API like gethostname(). This too is not in conflict with those who wish to use integer component_id values, because those need to be set manually any way. But for those who with to use hostnames, it eliminates another configuration burden.

Also, it would probably be best to have the default value determined in ldmsd proper rather than requiring each plugin to individually develop methods to determine the default.

For complex systems, it should always be our goal to make configuration be as simple as reasonable. It is possible that in many cases, with careful schema design, a plugin can operate with little or no configuration decisions on the part of the user.

morrone avatar Apr 30 '19 17:04 morrone

component_id can be specified (or defaults) 0 for sites that don't want to use it. producer= (ProducerName in csv store data) is the string identifier you seek. For component_id, we at Sandia trivially maintain a confluence page tabulating the component_ids site-wide across multiple admin groups.

In v4 ldms, component_id is handled in the sampler_base implementation, so revisions to handling component_id handling (in set or out) can be managed easily. We could conceivably take a specification of component_id=never to cause component_id to be omitted.

baallan avatar May 01 '19 19:05 baallan

You have mentioned elsewhere that there may be multiple ldmsd running on a node in the future. In which case each ldmsd needs to have a different ProducerName, would it not? And then it no longer servers to singularly identify the component on which it is running. Is suppose if the multiple ldmsd on a node could use the same ProducerName and don't need to be differentiated, that would not be a concern.

morrone avatar May 01 '19 22:05 morrone

there is no relationship among daemons on a single node. so same prodname is fine where it makes sense. ldms supports multiple instances ala systemd @.service files today.

baallan avatar May 02 '19 00:05 baallan

All ldms plugins are currently required to honor a "component_id" option and store it in a metadata metric in all of its metric sets. The data type of component_id is currently an LDMS_V_U64.

As I understand it, the intention is that this number will uniquely identify the component on which the ldmsd is running, e.g. a node. So one common use appears to be that if one has a compute cluster of 1000 nodes, each of the nodes will be assigned a unique integer which is passed to every ldms plugin and in turn stored in the component_id of each metric set on a node.

Actually, this is not the intention of the component_id. The component_id identifies the component being monitored, not who/what is monitoring that component. A node may have have 10's to 100's of components being monitored. CPU, Mount Points, etc...

This probably works fine when there is only a single cluster in one's center, or if every cluster in a center has its own separate monitoring system. However it begins to become an undesirable burden when scaled up.

Consider a site that has 15 clusters and all of the monitoring data from each of the clusters is combined into a single central monitoring database. One might imagine that the configuration of such a system becomes a challenge in general. Now we try to add ldms to that system, and suggenly the system administrators have a new configuration burden to uniquely assign integer values to all nodes across all of the clusters, even though those clusters may be maintained by many different people at various times.

I think the assignment of component_id is a pain. Absolutely. But if you intend to associate the collected data with the entity which is being monitored, then it seems like a necessary evil.

The sysadmins might rightfully argue that they already have a unique way to identify nodes: hostnames. The additional integer component ids just for ldms are an added configuration burden that they would not want.

So I would recommend that we can address this issue by changing the type of component_id from an integer to a string. For people that wish to assign unique integers to all nodes, they can still do that because an number can be stored in string form (and possibly converted into a real integer before final insertion into the final monitoring database). But a hostname cannot be stored in an integer.

It really doesn't matter (I don't think) what the data type is. The problem is what the heck does the key mean and how do you use it.

A string component_id also accommodates other components that might not look exactly like a linux node (network switches?).

If the scope of an entity is the containing node, then this makes complete sense. I just don't think we're ready to concede that course level or correspondence yet.

I would recommend also that the default value of the component_id string be the hostname on systems that support an API like gethostname(). This too is not in conflict with those who wish to use integer component_id values, because those need to be set manually any way. But for those who with to use hostnames, it eliminates another configuration burden.

Also, it would probably be best to have the default value determined in ldmsd proper rather than requiring each plugin to individually develop methods to determine the default.

For complex systems, it should always be our goal to make configuration be as simple as reasonable. It is possible that in many cases, with careful schema design, a plugin can operate with little or no configuration decisions on the part of the user.

tom95858 avatar May 08 '19 23:05 tom95858

Actually, this is not the intention of the component_id. The component_id identifies the component being monitored, not who/what is monitoring that component. A node may have have 10's to 100's of components being monitored. CPU, Mount Points, etc...

Ah, I see. Are those supposed to be globally unique as well? Unique within a schema type?

In an event, that would seem to make it even more of a configuration burden for the sysadmins if every single little component needs to be given an integer id in addition to, and independent from all other normal methods of identifying components.

I think the assignment of component_id is a pain. Absolutely. But if you intend to associate the collected data with the entity which is being monitored, then it seems like a necessary evil.

I don't think that integer-only identifiers are necessary. A string would allow each plugin to choose identifiers that are both unique and human readable, and do it dynamically without requiring human intervention to configure.

If the scope of an entity is the containing node, then this makes complete sense. I just don't think we're ready to concede that course level or correspondence yet.

I think it makes equal sense for more granular components as well. A plugin that creates a metric set for each DIMM on a node might have component names that identify the node and dimm slot, e.g. "hostname14/dimm2". Then again, that might already be what is in the metric set instance name. So maybe component name is the dimm serial number string or something. Maybe something other than a well chosen instance name is not needed for most things... but maybe there are other uses of the instance name that I am not yet familiar with.

morrone avatar May 10 '19 22:05 morrone

In some snl discussions, one notion of what to do with component ID that came up was:

Define component_ids as an integer (8 bytes) with subsections as follows: bytes 1,2 network number, byte 3 cluster number, byte 4 component type, bytes 5-8 site device number. Where network numbers assigned by community registry. cluster number assigned by owner. machine type where daemon runs (0: compute, 1 admin, 2 login, 3 gateway, 4 top of cluster, 5-255 tbd) device number assigned by owner (possibly derivable from hardware uuids?)

But it's fair to say that no single particularly satisfying scheme has been discovered.

But the thing I find myself in want of a lot over the last few years of analytics is to be able to tag a metric (usually a metadata metric) as a device identifier(for string) or as a device numerator(for numbers). Or in some cases, more than one metric in a schema as a device identifer, for the cases where devices have subdevices. I'm inclined toward a metadata metric naming convention to achieve this effect, since dragging along the "userdata" field on all metrics is just wasteful for most metrics (and also not human readable).

As examples: dev0_producername {meta, identifier} chama23 dev0_component_id {meta, numerator} 7000012 dev1_raid_id {meta, identifier} md0 dev2_raid_member {meta, numerator} 2 dev1_mem_id {meta, identifier} dimm_a

devN is the indicator of a device naming metric at the Nth sublevel.

With this, storage and analysis tools can automatically determine multipart keys from flat data. There are other ways to get the same effect (like defining a hierarchical description mapping the flat set of columns). This description would then allow a store access to both the linear schema names and a hierarchical naming that could map to json, etc.

baallan avatar May 13 '19 15:05 baallan