semantic-conventions
semantic-conventions copied to clipboard
Issues with Hardware Metrics semantic conventions
What are you trying to achieve?
Follow recommendations on: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/
What did you expect to see?
Consistent and implementable specification.
Additional context.
Ran across following issues when trying to map GPU metrics to the semantics...
Non-implementable / unspecified items:
- Single
firmware_versionwon't work:- Devices have multiple types of firmware (display, media, scheduling/power management etc)
hw.errorshave justhw.error.typeattribute, although:- Errors can be correctable, uncorrectable (data lost), or fatal (functionality lost, at least reset needed to recover)
- Errors can originate from different parts of the SW stack (FW, kernel, userspace driver)
- Errors can originate from different parts of the device HW (display, media, compute, 3d etc)
- => I suggest adding
.categoryattribute, similarly to Level-Zero spec:- https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-cat-t
Inconsistencies:
hw.gpu.powervs.hw.power{hw.type="gpu"}confusion- If both are valid, why there's no
hw.gpu.energyto matchhw.energy{hw.type="gpu"}?
- If both are valid, why there's no
- Common HW
name&idattributes vs. GPUmodel&serialattributes- Should all of these be provided despite overlap?
- Why
vendorattribute is used for GPU devices, butmanufacturerfor (host) device: https://opentelemetry.io/docs/specs/semconv/resource/device/ - Inconsistent attribute examples for GPU metrics missing from spec:
system.cpu.frequencyvs.hw.cpu.speed.utilizationvs..*_ratio[1] suffix for things like (frequency) throttling- whether in addition to base metric, one should provide
.utilization/.*_ratio, or.limitvalue?
[1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics
PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional all category could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-grained categories are missing, and/or in addition to them.
FYI @bertysentry
@eero-t Thank you for the feedback, give me a few hours to answer your questions. We will address all issues!
Thank you for the feedback, @eero-t!
I will try to answer your questions and we will discuss parts what will require an update of the conventions to be able to cover your use cases.
- Single
firmware_versionwon't work:
- Devices have multiple types of firmware (display, media, scheduling/power management etc)
Unfortunately, we cannot include all cases of firmware sub-components in the specifications for metrics semantic conventions. However, I know the group is working on a different type of entity (like a Resource) where one could put these kind of attributes describing the monitored component, without polluting the timeseries with too many attributes.
In the meantime, we recommend using the firmware_version attribute to put all necessary information. It's a free format string, it doesn't need to be exactly x.y.z. You set this attribute to: display 23.1; media 1.0.00; scheduling 2.13A.
hw.errorshave justhw.error.typeattribute, although:
Errors can be correctable, uncorrectable (data lost), or fatal (functionality lost, at least reset needed to recover)
Errors can originate from different parts of the SW stack (FW, kernel, userspace driver)
Errors can originate from different parts of the device HW (display, media, compute, 3d etc)
=> I suggest adding
.categoryattribute, similarly to Level-Zero spec:
- https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-ras-error-cat-t
The hw.error.type is a free format string that you can set to any applicable value.
[!TIP] As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with
hw.error.type="correctable"and once withhw.error.type="display". Most users will first display the total number of errors assum(hw.error{hw.type="gpu"})before breaking down by error type.
hw.gpu.powervs.hw.power{hw.type="gpu"}confusion
This is totally an oversight. The hw.gpu.power metric must be removed from the semantic conventions. Please use hw.power{hw.type="gpu"}.
- Why
vendorattribute is used for GPU devices, butmanufacturerfor (host) device: https://opentelemetry.io/docs/specs/semconv/resource/device/
That's a very good point. In hardware metrics, we've preferred vendor over manufacturer, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.
We're totally open to discussion on this point. Again, this may be better implemented as part of the new "entity" thing (I can't retrieve the link to the proposed spec for this).
system.cpu.frequencyvs.hw.cpu.speed
Are you suggesting we add hw.gpu.speed? It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)
WRT to frequency or speed I agree we should use the same terminology everywhere and use the term that the industry commonly uses.
.utilizationvs..*_ratio[1] suffix for things like (frequency) throttling- whether in addition to base metric, one should provide
.utilization/.*_ratio, or.limitvalue? [1] Used e.g. in: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/#hwfan---fan-metrics
In OpenTelemetry, *.utilization metrics are usually used in conjunction with corresponding *.usage and *.limit. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this case hw.fan.speed.utilization was really improper.
PS. Summing errors is not very meaningful (rate is more interesting), but maybe additional
allcategory could be provided just for indicating whether there are any errors (within query period) from given HW? It could be useful both when more fine-categories are missing, and in addition to them.
For proper aggregation and rate calculation, it is important (when possible) to store the total number of errors and then let the timeseries database calculate the rate for you on a period of time chosen at query time.
Thank you again for the thorough review. We will start by fixing the hw.gpu.power which is not supposed to be here. For other suggestions, we will need the input from the rest of the team:
system.cpu.frequencyvs.hw.cpu.speedvendorvsmanufacturer- Add
hw.gpu.speed?
In the meantime, we recommend using the
firmware_versionattribute to put all necessary information. It's a free format string, it doesn't need to be exactlyx.y.z. You set this attribute to:display 23.1; media 1.0.00; scheduling 2.13A.
In that case its name should be changed either to be firmware_versions, or e.g. just firmware, to indicate that its value is not semver. And description should be updated accordingly ("free-form list of FW types & their versions").
The
hw.error.typeis a free format string that you can set to any applicable value.
Not really the answer I was expecting... :-/
As this is an UpDownCounter metric, we recommend to make sure that one error is not counted several times (e.g. once with
hw.error.type="correctable"and once withhw.error.type="display". Most users will first display the total number of errors assum(hw.error{hw.type="gpu"})before breaking down by error type.
Good to know, thanks!
The
hw.gpu.powermetric must be removed from the semantic conventions. Please usehw.power{hw.type="gpu"}.
OK.
In hardware metrics, we've preferred
vendorovermanufacturer, because it's easier for the end-user to identify the vendor. Example: the vendor of a disk may be "Dell-EMC", while the manufacturer is "WD". Both have value, but in case of a failure, the end user is more likely to contact the vendor rather than the real manufacturer.
I see, there are cases where it would be better to use both. However, if vendor and manufacturer are same (which is also often the case), it's inconsistent that "vendor" should be used for some things, and "manufacturer" for others.
IMHO spec could specify both attributes for everything, and state that other one can be dropped if it's not relevant.
Btw. Sysman API spec lists brand & vendor instead of vendor & manufacturer: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-device-properties-t
system.cpu.frequencyvs.hw.cpu.speedAre you suggesting we add
hw.gpu.speed?
No, that sounds really odd. I think it should be hw.gpu.frequency.
"Speed" could be interpreted e.g. to be FLOPS, whereas I think the base unit for "frequency" is clear (HZ).
It's probably a good idea! What speed would we be reporting? (per core, memory, etc.)
OneAPI Level-Zero Sysman API specification lists following potential metrics for devices like GPUs (and FPGAs etc):
- Engine utilization: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-engine-stats-t
- For large number of engine types: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-engine-group-t
- Fabric throughput: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-fabric-port-state-t
- Fan speed: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-fan-speed-t
- Frequency: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-freq-state-t
- For GPU, memory and media: https://spec.oneapi.io/level-zero/latest/sysman/api.html#frequency-enums
- Memory usage: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-mem-state-t
- Either system (shared with CPU) or device local, and with large number of different types
- https://spec.oneapi.io/level-zero/latest/sysman/api.html#memory-enums
- Memory bandwidth: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-mem-bandwidth-t
- Energy/power usage: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-power-energy-counter-t
- With separate limits for peak, burst, and sustained usage (device specific time intervals)
- PSU current, temperature etc: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zes-psu-state-t
- Error counters of different types and categories: https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras-enums
- Temperature: https://spec.oneapi.io/level-zero/latest/sysman/api.html#zestemperaturegetstate
In OpenTelemetry,
*.utilizationmetrics are usually used in conjunction with corresponding*.usageand*.limit. In the case of fan speeds, some fans don't indicate their real speed in rpm and their maximum speed. Also in this casehw.fan.speed.utilizationwas really improper.
Sounds reasonable.
(I need to think how it applies in my case as some of the metrics are viewed also directly, not just through OpenTelemetry tooling / queries etc.)
The
hw.error.typeis a free format string that you can set to any applicable value.Not really the answer I was expecting... :-/
@bertysentry I mean, different categories of errors require different responses, and from different people:
- User-space (driver) errors: automatically blacklist given deployment / image version => until service developer updates it
- Kernel & FW errors: Taint device, drain node, update kernel/FW and reboot => cluster admin can do it remotely
- HW & some FW errors: Taint/drain node, replace HW, power up again => needs HW admin local access
Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.
Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).
And please comment also on this:
- Common HW
name&idattributes vs. GPUmodel&serialattributes
- Should all of these be provided despite overlap?
The
hw.error.typeis a free format string that you can set to any applicable value. @bertysentry I mean, different categories of errors require different responses, and from different people:
- User-space (driver) errors: automatically blacklist given deployment / image version => until service developer updates it
- Kernel & FW errors: Taint device, drain node, update kernel/FW and reboot => cluster admin can do it remotely
- HW & some FW errors: Taint/drain node, replace HW, power up again => needs HW admin local access
Semantics for such differences should be explicitly specified, and separate from whether given error is correctable or uncorrectable (i.e. what's the severity of given error category), not included in some "freeform" string.
Otherwise one cannot group the errors meaningfully over different device types (are there e.g. errors that require on-site admin presence).
@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the hw.error.type attribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.
Actually, we may change hw.error.type to simply error.type in the future. error.type is the only required attribute for errors, in general, in Otel semantic conventions.
- Common HW
name&idattributes vs. GPUmodel&serialattributes
- Should all of these be provided despite overlap?
serial_numberandidare not necessarily the same. If they are same in your instrumentation, you can dropserial_number, as it's not a required attribute (only "recommended")- Same for
nameandmodel: you can dropmodelif it's always the same as thenameattribute in your instrumentation. Simply make sure to always follow the same logic for all devices.
@eero-t Don't worry, we totally understand there are different types of errors that need different responses. That's the reason of the
hw.error.typeattribute, to cover different types (or categories) or errors. Error types are not just "correctable" or "non-correctable", these are just mere examples of error types for GPUs. The semantic conventions allow you to extend to any type or category of errors as required. This allows the instrumentation to be exhaustive, while the spec remains flexible and future-proof.
@bertysentry My point was that those items are semantically orthogonal and should therefore have separate attributes in the semantics spec. Current spec seems to be confused about the role of the errors .type attribute(s) which I think is also is a good indication that it should be split into multiple ones.
For example:
.severity: correctable, uncorrectable (data loss), fatal (functionality loss).type: type / category of the error; timeout, memory-parity-error, hw-programming-error etc.location/.origin/.cause/.whatever: e.g. http-endpoint, display (part of GPU with some memory), compute (GPU pipeline setting)
Actually, we may change
hw.error.typeto simplyerror.typein the future.
Yes, please! Using already existing shorter common ones, makes both the spec, and the metrics output compliant with it, more readable.
error.typeis the only required attribute for errors, in general, in Otel semantic conventions.
Hm. Instead of stuffing semantically different types of error attributes as free-from text into single .type attribute (your recommendation), this page recommends to:
- Use a domain-specific attribute
- Set
error.typeto capture all errors, regardless of whether they are defined within the domain-specific set or not.
?
Apart from the above, rpm for fan speed is also not a proper UCUM unit expression. There is some discussion on what the proper expression should be, and I think the candidates are /min (or annotated as {rev}/min, {revolution}/min, or {rotation}/min) and circ/min.