opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

Prometheus receiver miss some metrics

Open peachisai opened this issue 1 year ago • 11 comments

Component(s)

cmd/otelcontribcol

What happened?

Description

When I use prometheus receiver to grab metrics, I found otel miss someone, but it could grab other mertics which have the similar structure.

Steps to Reproduce

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "nacos-monitoring"
          scrape_interval: 30s
          metrics_path: "/nacos/actuator/prometheus"
          static_configs:
            - targets: ['127.0.0.1:8848']
          relabel_configs:
            - source_labels: [ ]
              target_label: cluster
              replacement: nacos-cluster
            - source_labels: [ __address__ ]
              regex: (.+)
              target_label: node
              replacement: $$1

processors:
  batch:

exporters:
  debug:
    verbosity: detailed

service:
  pipelines:
    metrics:
      receivers:
        - prometheus
      processors:
        - batch
      exporters:
        - debug

Expected Result

orginal data

nacos_monitor{module="naming",name="serviceCount",} 0.0
nacos_monitor{module="naming",name="ipCount",} 0.0

Actual Result

Only get ipCount

NumberDataPoints #7
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(ipCount)
     -> node: Str(43.139.166.178:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-19 02:48:13.416 +0000 UTC
Value: 0.000000

Collector version

v0.107.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

peachisai avatar Aug 19 '24 03:08 peachisai

Pinging code owners for receiver/prometheus: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Aug 19 '24 19:08 github-actions[bot]

Do you see anything in the logs?

Can you enable debug logging, and let us know if there are any scrape failures, etc?

can you share the full scrape response for that metric?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

dashpole avatar Aug 19 '24 20:08 dashpole

Do you see anything in the logs?

Can you enable debug logging, and let us know if there are any scrape failures, etc?

can you share the full scrape response for that metric?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

Hi, Thank you for the reply. I use

exporters:
  debug:
    verbosity: detailed

These are some parts of my log. I didn't find some errors or failures, and I can't found the missed target names.

StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> action: Str(end of minor GC)
     -> cause: Str(Allocation Failure)
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #37
Descriptor:
     -> Name: executor_pool_max_threads
     -> Description: The maximum allowed number of threads in the pool
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(applicationTaskExecutor)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 2147483647.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(taskScheduler)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 2147483647.000000
Metric #38
Descriptor:
     -> Name: nacos_naming_subscriber
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v2)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #39
Descriptor:
     -> Name: jvm_classes_loaded_classes
     -> Description: The number of classes that are currently loaded in the Java virtual machine
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 14983.000000
Metric #40
Descriptor:
     -> Name: tomcat_sessions_created_sessions_total
     -> Description:
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #41
Descriptor:
     -> Name: tomcat_sessions_alive_max_seconds
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #42
Descriptor:
     -> Name: nacos_naming_publisher
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> version: Str(v2)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #43
Descriptor:
     -> Name: jvm_gc_memory_allocated_bytes_total
     -> Description: Incremented for an increase in the size of the (young) heap memory pool after one GC to before the next
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 31471073024.000000
Metric #44
Descriptor:
     -> Name: executor_completed_tasks_total
     -> Description: The approximate total number of tasks that have completed execution
     -> Unit:
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(applicationTaskExecutor)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(taskScheduler)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 181528.000000
Metric #45
Descriptor:
     -> Name: nacos_timer_seconds
     -> Description:
     -> Unit:
     -> DataType: Summary
SummaryDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(writeConfigRpcRt)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 2024-08-20 06:42:13.391 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Count: 2
Sum: 0.114000
Metric #46
Descriptor:
     -> Name: jdbc_connections_min
     -> Description: Minimum number of idle connections in the pool.
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(dataSource)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: -1.000000
Metric #47
Descriptor:
     -> Name: http_server_requests_seconds_max
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/v2/core/cluster/node/list)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/actuator/prometheus)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.003789
NumberDataPoints #2
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SUCCESS)
     -> status: Str(200)
     -> uri: Str(/v1/console/namespaces)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> exception: Str(None)
     -> method: Str(GET)
     -> node: Str(127.0.0.1:8848)
     -> outcome: Str(SERVER_ERROR)
     -> status: Str(501)
     -> uri: Str(root)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: 0.000000
Metric #48
Descriptor:
     -> Name: jdbc_connections_max
     -> Description: Maximum number of active connections that can be allocated at the same time.
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> name: Str(dataSource)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-20 06:42:13.391 +0000 UTC
Value: -1.000000
Metric #49
Descriptor:
     -> Name: executor_queued_tasks
     -> Description: The approximate number of tasks that are queued for execution
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

peachisai avatar Aug 20 '24 06:08 peachisai

@dashpole Hi, I found this issue was assigned. If any detail should I provide, please ping me.

peachisai avatar Aug 23 '24 02:08 peachisai

Were you able to check this?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

dashpole avatar Aug 23 '24 13:08 dashpole

Were you able to check this?

Can you look at the up, and scrape_* metrics to see if any targets are failing to be scraped, or any metrics are being dropped by the receiver?

Hi, I did not find some error. did you mean config the receivers to get the scrape log? sorry I don't know how to do it, could you give me some advice? This is my receiver config

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "nacos-monitoring"
          scrape_interval: 30s
          metrics_path: "/nacos/actuator/prometheus"
          static_configs:
            - targets: ['127.0.0.1:8848']
          relabel_configs:
            - source_labels: [ ]
              target_label: cluster
              replacement: nacos-cluster
            - source_labels: [ __address__ ]
              regex: (.+)
              target_label: node
              replacement: $$1

peachisai avatar Aug 23 '24 16:08 peachisai

You should get additional metrics with names "up", and "scrape_series_added", and a few other scrape_.* metrics. The scrape.* metrics let you know if any metrics were dropped or rejected by Prometheus

dashpole avatar Aug 23 '24 18:08 dashpole

failing

Hi,I filter the metrics up and scrape_*, still found nothing

Descriptor:
     -> Name: up
     -> Description: The scraping was successful
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Metric #4
Descriptor:
     -> Name: scrape_series_added
     -> Description: The approximate number of new series in this scrape
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Descriptor:
     -> Name: scrape_samples_post_metric_relabeling
     -> Description: The number of samples remaining after metric relabeling was applied
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0

Metric #1
Descriptor:
     -> Name: scrape_duration_seconds
     -> Description: Duration of the scrape
     -> Unit: s
     -> DataType: Gauge
NumberDataPoints #0

peachisai avatar Aug 23 '24 20:08 peachisai

Right, you will need to look at the values of those metrics to see if any are being dropped, or if the target is down. Otherwise, if you can provide the full output of the prometheus endpoint (e.g. using curl), we can try to reproduce.

dashpole avatar Aug 26 '24 14:08 dashpole

Right, you will need to look at the values of those metrics to see if any are being dropped, or if the target is down. Otherwise, if you can provide the full output of the prometheus endpoint (e.g. using curl), we can try to reproduce.

I browsed the log detailly but still found nothing contains error or drop. May I send you an email with my remote peer endpoint ?

peachisai avatar Aug 27 '24 03:08 peachisai

I browsed the log detailly but still anything contains error or drop. May I send you an email with my remote peer endpoint ?

No, sorry. Please don't email me links. I also don't actually need your logs--I need the metrics scrape response.

dashpole avatar Aug 27 '24 15:08 dashpole

I browsed the log detailly but still anything contains error or drop. May I send you an email with my remote peer endpoint ?

No, sorry. Please don't email me links. I also don't actually need your logs--I need the metrics scrape response.

Hi, I found nothing drop or error in the metrics scrape response. But it overlooked some certain segments

nacos_monitor{module="naming",name="mysqlHealthCheck",} 0.0
nacos_monitor{module="naming",name="emptyPush",} 0.0
nacos_monitor{module="config",name="configCount",} 2.0
nacos_monitor_count{module="core",name="raft_read_from_leader",} 0.0
nacos_monitor_sum{module="core",name="raft_read_from_leader",} 0.0
nacos_monitor{module="naming",name="tcpHealthCheck",} 0.0
nacos_monitor{module="naming",name="serviceChangedEventQueueSize",} 0.0
nacos_monitor{module="core",name="longConnection",} 0.0
nacos_monitor{module="naming",name="totalPush",} 0.0
nacos_monitor{module="naming",name="serviceSubscribedEventQueueSize",} 0.0
nacos_monitor{module="naming",name="serviceCount",} 0.0
nacos_monitor{module="naming",name="httpHealthCheck",} 0.0
nacos_monitor{module="naming",name="maxPushCost",} -1.0
nacos_monitor{module="config",name="longPolling",} 0.0
nacos_monitor{module="naming",name="failedPush",} 0.0
nacos_monitor{module="naming",name="leaderStatus",} 0.0
nacos_monitor{module="config",name="publish",} 0.0
nacos_monitor{module="config",name="dumpTask",} 0.0
nacos_monitor_count{module="core",name="raft_read_index_failed",} 0.0
nacos_monitor_sum{module="core",name="raft_read_index_failed",} 0.0
nacos_monitor{module="config",name="notifyTask",} 0.0
nacos_monitor{module="config",name="fuzzySearch",} 0.0
nacos_monitor{module="naming",name="avgPushCost",} -1.0
nacos_monitor{module="config",name="getConfig",} 0.0
nacos_monitor{module="naming",name="totalPushCountForAvg",} 0.0
nacos_monitor{module="naming",name="subscriberCount",} 0.0
nacos_monitor{module="naming",name="ipCount",} 0.0
nacos_monitor{module="config",name="notifyClientTask",} 0.0
nacos_monitor{module="naming",name="totalPushCostForAvg",} 0.0
nacos_monitor{module="naming",name="pushPendingTaskCount",} 0.0
# HELP nacos_monitor_max  

Above nacos_monitor_sum{module="core",name="raft_read_index_failed",} 0.0 cannot be scraped The rest metrics below it can be scraped

There are the scrape log

Descriptor:
     -> Name: disk_total_bytes
     -> Description: Total space for path
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> node: Str(127.0.0.1:8848)
     -> path: Str(D:\ideaprojects\github\nacos\.)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 296022437888.000000
Metric #69
Descriptor:
     -> Name: nacos_monitor
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(core)
     -> name: Str(raft_read_index_failed)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(notifyTask)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(fuzzySearch)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(avgPushCost)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: -1.000000
NumberDataPoints #4
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(config)
     -> name: Str(getConfig)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #5
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(totalPushCountForAvg)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #6
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(subscriberCount)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000
NumberDataPoints #7
Data point attributes:
     -> cluster: Str(nacos-cluster)
     -> module: Str(naming)
     -> name: Str(ipCount)
     -> node: Str(127.0.0.1:8848)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-08-31 14:08:35.452 +0000 UTC
Value: 0.000000

peachisai avatar Aug 31 '24 14:08 peachisai

I will have a try to debug the code

peachisai avatar Sep 07 '24 06:09 peachisai

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Nov 07 '24 03:11 github-actions[bot]

still exist

peachisai avatar Nov 28 '24 10:11 peachisai

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

  • receiver/prometheus: @Aneurysm9 @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Jan 29 '25 03:01 github-actions[bot]

This issue has been closed as inactive because it has been stale for 120 days with no activity.

github-actions[bot] avatar Mar 30 '25 05:03 github-actions[bot]

still exist

I have also encountered the same problem, have you solved it?

seal90 avatar Aug 07 '25 08:08 seal90