dapr
dapr copied to clipboard
Dapr Opentelemetry Semantic Convention and Metric Naming
In what area(s)?
/area runtime
Describe the feature
Only keep the trace sdk and metric sdk in the Opentelemetry API, and standardize the observability report.
First we need to unify the semantics of observability.
Opentelmetry Semantic Convention
current name | name | value | description |
---|---|---|---|
app_id | service.name | - | custom value |
namespace | service.namespace | - | custom value, such as: default, dapr-system, and etc. |
Opentelmetry Dapr Extensional Semantic Convention
current name | name | value | description |
---|---|---|---|
process_status | dapr.component.pubsub.process_status | SUCCESS;RETRY;DROP | component.pubsub msg return status |
success | dapr.component.success | true;false | component handle return status |
topic | dapr.component.pubsub.topic | custom | component.pubsub topic |
componentName | dapr.component.name | custom | component.name |
component | dapr.component.type | state;binding,configuration;pubsub | component type |
operation | dapr.component.operation | custom | component operation |
- | rpc.method | - | http Method: GET, POST... |
dapr.api | rpc.api | - | RPC={RPCName} |
- | rpc.type | Server;Client | rpc type |
- | rpc.status_code | Unset;Ok;Error | RPC error, grpc and http |
reason | dapr.reason | Custom | component init; mtls and etc faile reason |
trustDomain | dapr.trust_domain | - | - |
policyAction | dapr.policy_action | true;false | request policy(ACL, RBAC) |
dapr.invoke_method | dapr.invoke_method | custom | service invocation method |
dapr.status_code | dapr.status_code | custom definition by protocol self | dapr rpc response status |
dapr.protocol | dapr.protocol | http;grpc | rpc protocol in dapr-ecosystem |
grpc_server_method | Deprecated | - | grpc proto API |
grpc_server_status | Deprecated | - | grpc status code |
grpc_client_method | Deprecated | - | grpc proto API |
grpc_client_status | Deprecated | - | - |
name | dapr.resiliency.name | custom value | resiliency name |
policy | dapr.resiliency.policy | circuitbreaker, retry, timeout | resiliency policy |
actor_type | dapr.actor.type | - | actor type |
Metrics
HTTP
current name | name | type | description |
---|---|---|---|
http_server_request_count | dapr_http_server_request_count | counter(int64) | Number of HTTP requests started in server. |
http_server_request_bytes | dapr_http_server_request_bytes | histogram(int64) | HTTP request body size if set as ContentLength (uncompressed) in server. |
http_server_response_bytes | dapr_http_server_response_bytes | histogram(int64) | HTTP response body size (uncompressed) in server. |
http_server_latency | dapr_http_server_latency | histogram(float64) | HTTP request end to end latency in server. |
http_server_response_count | dapr_http_server_response_count | counter(int64) | The number of HTTP responses |
http_client_sent_bytes | dapr_http_client_sent_bytes | histogram(int64) | Total bytes sent in request body (not including headers) |
http_client_received_bytes | dapr_http_client_received_bytes | histogram(int64) | Total bytes received in response bodies (not including headers but including error responses with bodies) |
http_client_roundtrip_latency | dapr_http_client_roundtrip_latency | histogram(float64) | Total bytes received in response bodies (not including headers but including error responses with bodies) |
http_client_completed_count | dapr_http_client_completed_count | counter(int64) | Count of completed requests |
http_healthprobes_completed_count | dapr_http_healthprobes_completed_count | counter(int64) | Count of completed health probes |
http_healthprobes_roundtrip_latency | dapr_http_healthprobes_roundtrip_latency | histogram(float64) | Time between first byte of health probes headers sent to last byte of response received, or terminal error |
gRPC
current name | name | type | description |
---|---|---|---|
grpc_io_server_received_bytes_per_rpc | dapr_grpc_io_server_received_bytes_per_rpc | histogram(int64) | Total bytes received across all messages per RPC. |
grpc_io_server_sent_bytes_per_rpc | dapr_grpc_io_server_sent_bytes_per_rpc | histogram(int64) | Total bytes sent in across all response messages per RPC. |
grpc_io_server_server_latency | dapr_grpc_io_server_server_latency | histogram(float64) | Time between first byte of request received to last byte of response sent, or terminal error. |
grpc_io_server_completed_rpcs | dapr_grpc_io_server_completed_rpcs | counter(int64) | Distribution of bytes sent per RPC, by method. |
grpc_io_client_sent_bytes_per_rpc | dapr_grpc_io_client_sent_bytes_per_rpc | histogram(int64) | Total bytes sent across all request messages per RPC. |
grpc_io_client_received_bytes_per_rpc | dapr_grpc_io_client_received_bytes_per_rpc | histogram(int64) | Total bytes received across all response messages per RPC. |
grpc_io_client_roundtrip_latency | dapr_grpc_io_client_roundtrip_latency | histogram(float64) | Time between first byte of request sent to last byte of response received, or terminal error. |
grpc_io_client_completed_rpcs | dapr_grpc_io_client_completed_rpcs | counter(int64) | Count of RPCs by method and status. |
Components
current name | name | type | description |
---|---|---|---|
component_pubsub_ingress_count | dapr_component_pubsub_ingress_count | counter(int64) | The number of incoming messages arriving from the pub/sub component. |
component_pubsub_ingress_latencies | dapr_component_pubsub_ingress_latencies | histogram(float64) | The consuming app event processing latency. |
component_pubsub_egress_count | dapr_component_pubsub_egress_count | counter(int64) | The number of outgoing messages published to the pub/sub component. |
component_pubsub_egress_latencies | dapr_component_pubsub_egress_latencies | histogram(float64) | The latency of the response from the pub/sub component. |
component_input_binding_count | dapr_component_input_binding_count | counter(int64) | The number of incoming events arriving from the input binding component. |
component_input_binding_latencies | dapr_component_input_binding_latencies | histogram(float64) | The triggered app event processing latency. |
component_output_binding_count | dapr_component_output_binding_count | counter(int64) | The number of operations invoked on the output binding component. |
component_output_binding_latencies | dapr_component_output_binding_latencies | histogram(float64) | The latency of the response from the output binding component. |
component_state_count | dapr_component_state_count | counter(int64) | The number of operations performed on the state component. |
component_state_latencies | dapr_component_state_latencies | histogram(float64) | The latency of the response from the state component. |
component_configuration_count | dapr_component_configuration_count | counter(int64) | The number of operations performed on the configuration component. |
component_configuration_latencies | dapr_component_configuration_latencies | histogram(float64) | The latency of the response from the configuration component. |
component_secret_count | dapr_component_secret_count | counter(int64) | The number of operations performed on the secret component. |
component_secret_latencies | dapr_component_secret_latencies | histogram(float64) | The latency of the response from the secret component. |
Runtime
current name | name | type | description |
---|---|---|---|
runtime_component_loaded | dapr_runtime_component_loaded | counter(int64) | The number of successfully loaded components. |
runtime_component_init_total | dapr_runtime_component_init_total | counter(int64) | The number of initialized components. |
runtime_component_init_fail_total | dapr_runtime_component_init_fail_total | counter(int64) | The number of component initialization failures. |
runtime_mtls_init_total | dapr_runtime_mtls_init_total | counter(int64) | The number of successful mTLS authenticator initialization. |
runtime_mtls_init_fail_total | dapr_runtime_mtls_init_fail_total | counter(int64) | The number of mTLS authenticator init failures. |
runtime_mtls_workload_cert_rotated_total | dapr_runtime_mtls_workload_cert_rotated_total | counter(int64) | The number of the successful workload certificate rotations |
runtime_mtls_workload_cert_rotated_fail_total | dapr_runtime_mtls_workload_cert_rotated_fail_total | counter(64) | The number of the failed workload certificate rotations. |
runtime_actor_status_report_total | dapr_runtime_actor_status_report_total | counter(int64) | The number of the successful status reports to placement service." |
runtime_actor_status_report_fail_total | dapr_runtime_actor_status_report_fail_total | counter(int64) | The number of the failed status reports to placement service. |
runtime_actor_table_operation_recv_total | dapr_runtime_actor_table_operation_recv_total | counter(int64) | The number of the received actor placement table operations. |
runtime_actor_rebalanced_total | dapr_runtime_actor_rebalanced_total | counter(int64) | The number of the actor rebalance requests. |
runtime_actor_deactivated_total | dapr_runtime_actor_deactivated_total | counter(int64) | The number of the successful actor deactivation. |
runtime_actor_pending_actor_calls | dapr_runtime_actor_pending_actor_calls | counter(int64) | The number of pending actor calls waiting to acquire the per-actor lock. |
runtime_acl_app_policy_action_allowed_total | dapr_runtime_acl_app_policy_action_allowed_total | counter(int64) | The number of requests allowed by the app specific action specified in the access control policy. |
runtime_acl_global_policy_action_allowed_total | dapr_runtime_acl_global_policy_action_allowed_total | counter(int64) | The number of requests allowed by the global action specified in the access control policy. |
runtime_acl_app_policy_action_blocked_total | dapr_runtime_acl_app_policy_action_blocked_total | counter(64) | The number of requests blocked by the app specific action specified in the access control policy. |
runtime_acl_global_policy_action_blocked_total | dapr_runtime_acl_global_policy_action_blocked_total | counter(64) | The number of requests blocked by the global action specified in the access control policy. |
Resiliency
current name | name | type | description |
---|---|---|---|
resiliency_loaded | dapr_resiliency_loaded | counter(int64) | Number of resiliency policies loaded. |
resiliency_count | dapr_resiliency_count | counter(int64) | Number of times a resiliency policyKey has been executed. |
Opentelemetry Init Configuration
apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
name: configuration
namespace: default
spec:
metric:
enabled: true
exporterAddress: "9.xxx.xxx.x8:4318"
tracing:
enabled: true
samplingRate: 1
exporterAddress: "9.xxx.xxx.x8:4317"
log:
file_enabled: false
log_level: 0
trace_enabled: true
encode_logs_as_json: true
nameresolution:
nr_component_name: "nameresolution-polaris"
service: "dapr.gbot.sqqgroup"
namespace: "Production"
metadata:
token: "77001b56xxxxxeb69ab0431836e9be"
Metric Init
package diagnostics
import (
"context"
"time"
"github.com/pkg/errors"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
"go.opentelemetry.io/otel/metric"
"go.opentelemetry.io/otel/metric/global"
"go.opentelemetry.io/otel/sdk/metric/aggregator/histogram"
controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
"go.opentelemetry.io/otel/sdk/metric/selector/simple"
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.9.0"
)
var (
// DefaultReportingPeriod is the default view reporting period.
DefaultReportingPeriod = 60 * time.Second
// DefaultMonitoring holds service monitoring metrics definitions.
DefaultMonitoring *serviceMetrics
// DefaultGRPCMonitoring holds default gRPC monitoring handlers and middlewares.
DefaultGRPCMonitoring *grpcMetrics
// DefaultHTTPMonitoring holds default HTTP monitoring handlers and middlewares.
DefaultHTTPMonitoring *httpMetrics
// DefaultComponentMonitoring holds component specific metrics.
DefaultComponentMonitoring *componentMetrics
// DefaultTRPCMonitoring holds default tRPC monitoring handlers and middlewares.
DefaultTRPCMonitoring *trpcMetrics
)
// MetricClient is a metric client.
type MetricClient struct {
AppID string
Namespace string
// Address collector receiver address.
Address string
meter metric.Meter
pusher *controller.Controller
exporter *otlpmetric.Exporter
}
// InitMetrics initializes metrics.
func InitMetrics(address, appID, namespace string) (*MetricClient, error) {
var err error
if address == "" {
address = defaultMetricExporterAddr
}
client := &MetricClient{
AppID: appID,
Namespace: namespace,
Address: address,
}
if err = client.init(); err != nil {
return nil, err
}
DefaultMonitoring = client.newServiceMetrics()
DefaultGRPCMonitoring = client.newGRPCMetrics()
DefaultHTTPMonitoring = client.newHTTPMetrics()
DefaultTRPCMonitoring = client.newTRPCMetrics()
DefaultComponentMonitoring = client.newComponentMetrics()
return client, nil
}
// ::TODO https://github.com/open-telemetry/opentelemetry-collector/issues/5238
func (m *MetricClient) init() error {
var err error
ctx := context.Background()
client := otlpmetrichttp.NewClient(
otlpmetrichttp.WithInsecure(),
otlpmetrichttp.WithEndpoint(m.Address))
m.exporter, err = otlpmetric.New(ctx, client)
if err != nil {
return errors.Errorf("Failed to create the collector exporter: %v", err)
}
res, _ := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String(m.AppID),
semconv.ServiceNamespaceKey.String(m.Namespace),
),
)
// ::TODO https://github.com/open-telemetry/opentelemetry-go/issues/2678
// fake boundary
bounds := []float64{5, 10, 50, 100, 150, 200, 250, 300, 350, 400, 500, 600, 700,
800, 900, 1000}
m.pusher = controller.New(
processor.NewFactory(
simple.NewWithHistogramDistribution(
histogram.WithExplicitBoundaries(
bounds)),
m.exporter,
),
controller.WithExporter(m.exporter),
controller.WithCollectPeriod(DefaultReportingPeriod),
controller.WithCollectTimeout(30*time.Second),
controller.WithPushTimeout(30*time.Second),
controller.WithResource(res))
global.SetMeterProvider(m.pusher)
// only global one meter, not multiple meters.
m.meter = global.Meter("mecha",
metric.WithInstrumentationVersion("v0.27.0"),
metric.WithSchemaURL("go.opentelemetry.io/otel/metric"))
if err := m.pusher.Start(ctx); err != nil {
return errors.Errorf("could not start metric controller: %v", err)
}
return nil
}
// Close close metric client.
func (m *MetricClient) Close() error {
if m == nil {
return nil
}
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
if err := m.pusher.Stop(ctx); err != nil {
otel.Handle(err)
}
if err := m.exporter.Shutdown(ctx); err != nil {
otel.Handle(err)
}
return nil
}
Trace Init
package diagnostics
import (
"context"
"time"
"github.com/pkg/errors"
// We currently don't depend on the Otel SDK since it has not GAed.
// This package, however, only contains the conventions from the Otel Spec,
// which we do depend on.
itrace "github.com/dapr/dapr/pkg/diagnostics/sdk/trace"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
//"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.9.0"
apitrace "go.opentelemetry.io/otel/trace"
)
const (
// daprInternalSpanAttrPrefix is the internal span attribution prefix.
// Middleware will not populate it if the span key starts with this prefix.
daprInternalSpanAttrPrefix = "__dapr."
// daprAPISpanNameInternal is the internal attribution, but not populated
// to span attribution.
daprAPISpanNameInternalKey = attribute.Key(daprInternalSpanAttrPrefix + "spanname
")
daprRPCServiceInvocationService = "ServiceInvocation"
daprRPCDaprService = "Dapr"
defaultTracingExporterAddr = "localhost:4318"
defaultMetricExporterAddr = "localhost:4318"
)
var defaultTracer apitrace.Tracer = apitrace.
NewNoopTracerProvider().Tracer("")
// StartInternalCallbackSpan starts trace span for internal callback such as input bindings and pubsub subscription.
func StartInternalCallbackSpan(ctx context.Context, spanName string, parent apitrace.Span
Context) (context.Context, apitrace.Span) {
ctx = apitrace.ContextWithRemoteSpanContext(ctx, parent)
return defaultTracer.Start(ctx, spanName,
apitrace.WithSpanKind(apitrace.SpanKindServer))
}
type TracingClient struct {
AppID string
Namespace string
// Address collector receiver address.
Address string
Token string
sampleRatio float64
exporter *otlptrace.Exporter
provider *sdktrace.TracerProvider
}
// InitTracing init tracing client.
func InitTracing(address, token, appID, namespace string, sampleRatio float64) (*TracingC
lient, error) {
if address == "" {
address = defaultTracingExporterAddr
}
client := &TracingClient{
Address: address,
Token: token,
AppID: appID,
Namespace: namespace,
sampleRatio: sampleRatio,
}
if err := client.init(); err != nil {
return nil, err
}
return client, nil
}
func (t *TracingClient) init() error {
var err error
client := otlptracegrpc.NewClient(
otlptracegrpc.WithEndpoint(t.Address),
otlptracegrpc.WithInsecure(),
)
t.exporter, err = otlptrace.New(context.Background(), client)
if err != nil {
return errors.Errorf("creating OTLP trace exporter: %v", err)
}
ssp := sdktrace.NewBatchSpanProcessor(t.exporter,
sdktrace.WithBatchTimeout(5*time.Second))
t.provider = sdktrace.NewTracerProvider(
sdktrace.WithSpanProcessor(ssp),
sdktrace.WithSampler(itrace.TraceIDBasedParentAndRatio(t.sampleRatio)),
sdktrace.WithResource(
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(t.AppID),
semconv.ServiceNamespaceKey.String(t.Namespace),
attribute.String("token", t.Token),
)),
)
otel.SetTracerProvider(t.provider)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.T
raceContext{}, propagation.Baggage{}))
defaultTracer = otel.GetTracerProvider().Tracer(
"mecha",
apitrace.WithInstrumentationVersion("v0.27.0"),
apitrace.WithSchemaURL("go.opentelemetry.io/otel/tracing"),
)
return nil
}
// Close close tracing client.
func (t *TracingClient) Close() error {
if t == nil {
return nil
}
ctx, cancel := context.WithTimeout(context.Background(), time.Second)
defer cancel()
if err := t.provider.Shutdown(ctx); err != nil {
return err
}
if err := t.exporter.Shutdown(ctx); err != nil {
return err
}
return nil
}
Looks good overall, cc @yaron2 and @artursouza for more feedback.
This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.