beats icon indicating copy to clipboard operation
beats copied to clipboard

Clarify documentation of beats output errors

Open jsoriano opened this issue 4 years ago • 13 comments

There are several metrics reporting output errors from the beat module, clarify their meaning in the docs, focusing on how worrisome these errors are for users.

For example it is not clear if a non-zero beat.stats.libbeat.output.write.errors implied some data loss, though it probably didn't if beat.stats.libbeat.output.events.dropped is zero.

For confirmed bugs, please report:

  • Discuss Forum URL: https://discuss.elastic.co/t/what-does-output-errors-mean-in-beats-monitoring-in-kibana/283365

jsoriano avatar Sep 06 '21 11:09 jsoriano

Pinging @elastic/integrations (Team:Integrations)

elasticmachine avatar Sep 06 '21 11:09 elasticmachine

Pinging @elastic/stack-monitoring (Stack monitoring)

elasticmachine avatar Sep 06 '21 11:09 elasticmachine

The user facing part of this (i.e. the kibana interface) could also benefit from a bit more clarification. For instance, in the image below, what should I be worried about? Do I have dropped events that got lost forever without being logged? How does "Output Errors" relate to "Fail Rates"?

Screen Shot 2021-09-08 at 14 03 15

nkakouros avatar Sep 08 '21 12:09 nkakouros

After reading the discussion and the docs I still don't know about the exact meaning of those metrics in the stack monitoring. I think we all agree that one of the most important questions a user has is: "Do I lose data or not?" I still can't answer this question. For example: I'm using packetbeat to gather DNS traffic and from time to time the "Failed in Pipeline" counter is jumping from 0 to 1000 and as stated above I'm asking myself: Do I lose those DNS queries or not?

Also I don't really get the difference between the two graphs "Fail Rates" and "Output Errors". To me it sounds like they are both doing the same, but they probably don't?

tom-x1 avatar Oct 10 '21 17:10 tom-x1

I would also like to know the meaning of these values, especially beat.stats.libbeat.output.events.dropped, if that is larger than 0 am I losing data?

sachinfrayne avatar May 10 '22 14:05 sachinfrayne

Inside the code for Libbeat:

	//
	// Output event stats
	//
	batches *monitoring.Uint // total number of batches processed by output
	events  *monitoring.Uint // total number of events processed by output

	acked      *monitoring.Uint // total number of events ACKed by output
	failed     *monitoring.Uint // total number of events failed in output
	active     *monitoring.Uint // events sent and waiting for ACK/fail from output
	duplicates *monitoring.Uint // events sent and waiting for ACK/fail from output
	dropped    *monitoring.Uint // total number of invalid events dropped by the output
	tooMany    *monitoring.Uint // total number of too many requests replies from output

	//
	// Output network connection stats
	//
	writeBytes  *monitoring.Uint // total amount of bytes written by output
	writeErrors *monitoring.Uint // total number of errors on write

	readBytes  *monitoring.Uint // total amount of bytes read
	readErrors *monitoring.Uint // total number of errors while waiting for response on output
}

When you're querying in Elastic for results of Libbeat (see below), the Output Errors is derived from the measured delta between the initial timestamp's readErrors + writeErrors and the latest timestamp's readErrors + writeErrors. According to the code commentary then, Output Errors is the number of network packets experiencing errors.

The below example is utilizing apm-server as the beat type, but you can replace it to suit your needs.

GET _search

{
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "should": [
                {
                  "term": {
                    "data_stream.dataset": "beats.stats"
                  }
                },
                {
                  "term": {
                    "metricset.name": "stats"
                  }
                },
                {
                  "term": {
                    "type": "beats_stats"
                  }
                }
              ]
            }
          },
          {
            "term": {
              "cluster_uuid": "CLUSTER_UUID"
            }
          },
          {
            "range": {
              "beats_stats.timestamp": {
                "format": "epoch_millis",
                "gte": 1665053615330,
                "lte": 1665054515330
              }
            }
          },
          {
            "bool": {
              "must": {
                "term": {
                  "beats_stats.beat.type": "apm-server"
                }
              }
            }
          }
        ]
      }
    },
    "collapse": {
      "field": "beats_stats.metrics.beat.info.ephemeral_id",
      "inner_hits": {
        "name": "earliest",
        "size": 1,
        "sort": [
          {
            "beats_stats.timestamp": {
              "order": "asc",
              "unmapped_type": "long"
            }
          },
          {
            "@timestamp": {
              "order": "asc",
              "unmapped_type": "long"
            }
          }
        ]
      }
    },
    "sort": [
      {
        "beats_stats.beat.uuid": {
          "order": "asc",
          "unmapped_type": "long"
        }
      },
      {
        "timestamp": {
          "order": "desc",
          "unmapped_type": "long"
        }
      }
    ]
  }

6fears7 avatar Oct 06 '22 13:10 6fears7

According to the code commentary then, Output Errors is the number of network packets experiencing errors.

@6fears7 So, we have event loss or not? I don't have any other errors, like 429

zez3 avatar Oct 06 '22 20:10 zez3

I've added this beat.stats.libbeat.output.events.dropped to my Agent dashboard

image

  • I get an daily alarm telling me that,

eg.

monitor filebeat drops - * is in a state of ALERT

Reason:
beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.

That 3 mil./day events are kind of a lot from my customer point of view

zez3 avatar Oct 06 '22 20:10 zez3

@6fears7 Well that still does not answer if I have lost events or not. An "error" doesn't necessarily mean that I have lost events, I don't know if after those errors occur some kind of mechanism tries to process those faulty events again or not. Telling the users just "Error" without clearly letting them know the consequences of it does not help. A graph which says "Events lost" would be clear and helpful in my opinion.

tom-x1 avatar Oct 07 '22 05:10 tom-x1

I've added this beat.stats.libbeat.output.events.dropped to my Agent dashboard

image

* I get an daily alarm telling me that,

eg.

monitor filebeat drops - * is in a state of ALERT

Reason:
beat.stats.libbeat.output.events.dropped is 3,313,220.83422 in the last 1 day for all hosts. Alert when > 0.

That 3 mil./day events are kind of a lot from my customer point of view

In client.go, we are given an initial struct of:

type bulkResultStats struct {
	acked        int // number of events ACKed by Elasticsearch
	duplicates   int // number of events failed with `create` due to ID already being indexed
	fails        int // number of failed events (can be retried)
	nonIndexable int // number of failed events (not indexable)
	tooMany      int // number of events receiving HTTP 429 Too Many Requests
}

Later, we see what constitutes a drop:

failed := len(failedEvents)
	span.Context.SetLabel("events_failed", failed)
	if st := client.observer; st != nil {
		dropped := stats.nonIndexable 
		duplicates := stats.duplicates
		acked := len(data) - failed - dropped - duplicates

		st.Acked(acked)
		st.Failed(failed)
		st.Dropped(dropped)
		st.Duplicate(duplicates)
		st.ErrTooMany(stats.tooMany)
	}

So dropped events would be those of the nonIndexable type.

In order to determine what a "nonIndexable" type is, the code iterates through the Bulk results:

... 
		if status < 500 { 
			if status == http.StatusTooManyRequests {
				stats.tooMany++
			} else {
				// hard failure, apply policy action
				result, _ := data[i].Content.Meta.HasKey(dead_letter_marker_field)
				if result {
					stats.nonIndexable++
					client.log.Errorf("Can't deliver to dead letter index event %#v (status=%v): %s", data[i], status, msg)
					// poison pill - this will clog the pipeline if the underlying failure is non transient.
				} else if client.NonIndexableAction == dead_letter_index {
					client.log.Warnf("Cannot index event %#v (status=%v): %s, trying dead letter index", data[i], status, msg)
					if data[i].Content.Meta == nil {
						data[i].Content.Meta = common.MapStr{
							dead_letter_marker_field: true,
						}
					} else {
						data[i].Content.Meta.Put(dead_letter_marker_field, true)
					}
					data[i].Content.Fields = common.MapStr{
						"message":       data[i].Content.Fields.String(),
						"error.type":    status,
						"error.message": string(msg),
					}
				} else { // drop
					stats.nonIndexable++
					client.log.Warnf("Cannot index event %#v (status=%v): %s, dropping event!", data[i], status, msg)
					continue
				}

So any event that cannot be indexed (as an example here) will become tagged under the dead_letter_marker_field and dropped from the pipeline.

To answer your question, I do believe that those events are worth looking into. My first guess would be to check how the pipeline is being parsed and if there's some Grok that needs to be done.

6fears7 avatar Oct 07 '22 12:10 6fears7

Thank you for clarifying @6fears7

I use official pipelines from Fleet integrations but also custom ones. I also use dissect and the rest of filebeat processor.(dns is still missing from elasticsearch processor) @sachin-frayne Can you help now to further identify those events and/or check the pipelines? as sugested above.

zez3 avatar Oct 09 '22 05:10 zez3

Here is what I see in my Agent filbeat metrics

{ "_index": ".ds-metrics-elastic_agent.filebeat-ece-2022.09.28-000017", "_id": "H4Apu4MBF-Kn07JyC-C-", "_version": 1, "_score": 0, "_source": { "agent": { "name": "my_host.my.dom", "id": "4ba5a4aa-848d-4435-a13e-ace9584cddaa", "ephemeral_id": "2a66e283-d27a-433b-8b39-ae7105878348", "type": "metricbeat", "version": "8.4.2" }, "@timestamp": "2022-10-09T05:12:51.923Z", "ecs": { "version": "8.0.0" }, "service": { "address": "http://unix/stats", "name": "beat", "type": "beat" }, "data_stream": { "namespace": "ece", "type": "metrics", "dataset": "elastic_agent.filebeat" }, "beat": { "elasticsearch": { "cluster": { "id": "tOaF05W8QCG1Q3z25K4WZQ" } }, "stats": { "handles": { "limit": { "hard": 524288, "soft": 1024 }, "open": 148 }, "system": { "load": { "1": 7.16, "5": 8.6, "15": 8.53, "norm": { "1": 0.0559, "5": 0.0672, "15": 0.0666 } }, "cpu": { "cores": 128 } }, "beat": { "host": "my_host.my.dom", "name": "my_host.my.dom", "type": "filebeat", "uuid": "a97378a4-c48f-45c2-901b-7b8b5f6b6502", "version": "8.4.2" }, "runtime": { "goroutines": 289 }, "cpu": { "total": { "ticks": 13560390, "time": { "ms": 13560390 }, "value": 13560390 }, "system": { "ticks": 1185830, "time": { "ms": 1185830 } }, "user": { "ticks": 12374560, "time": { "ms": 12374560 } } }, "memstats": { "memory": { "total": 1581247076272, "alloc": 286642512 }, "rss": 546992128, "gc_next": 322126080 }, "libbeat": { "output": { "read": { "bytes": 515468121, "errors": 0 }, "type": "elasticsearch", "write": { "bytes": 33985770829, "errors": 0 }, "events": { "batches": 479155, "duplicates": 71111, "total": 23936376, "dropped": 23671, "toomany": 0, "active": 50, "failed": 0, "acked": 23841544 } }, "pipeline": { "clients": 20, "queue": { "acked": 23936326 }, "events": { "total": 23940457, "filtered": 12, "dropped": 0, "active": 4119, "failed": 0, "published": 23940442, "retry": 2048 } }, "config": { "running": 0, "stops": 0, "starts": 20, "reloads": 0 } }, "cgroup": { "memory": { "mem": { "usage": { "bytes": 2462949376 }, "limit": { "bytes": 53687091200 } }, "id": "elastic-agent.service" }, "cpu": { "cfs": { "period": { "us": 100000 }, "quota": { "us": 0 } }, "stats": { "periods": 0, "throttled": { "ns": 0, "periods": 0 } }, "id": "elastic-agent.service" }, "cpuacct": { "total": { "ns": 4050343270691067 }, "id": "elastic-agent.service" } }, "uptime": { "ms": 27149727 }, "info": { "name": "filebeat", "ephemeral_id": "09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42", "version": "8.4.2", "uptime": { "ms": 27149727 } } }, "id": "a97378a4-c48f-45c2-901b-7b8b5f6b6502", "type": "filebeat" }, "host": { "hostname": "my_host.my.dom", "os": { "kernel": "5.4.0-104-generic", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.5 LTS (Focal Fossa)", "platform": "ubuntu" }, "containerized": false, "ip": [ "fe80::1ac0:4dff:fe7a:5796", "1.1.1.1", "fe80::ba59:9fff:fed2:222", "1.1.1.1", "192.168.43.1", "fe80::42:73ff:fe00:5c0a", "10.0.3.1", "fe80::c5e:4ff:feff:61bb", "fe80::b4c7:89ff:fe39:ef4d", "fe80::8014:f6ff:fe19:bd2a", "fe80::f49b:71ff:fef1:1993", "fe80::9454:a4ff:fee7:cdbf", "fe80::dcd6:6ff:fee9:c14d", "fe80::34fa:29ff:fe12:af8a", "fe80::866:4ff:fe83:8873", "fe80::7059:d7ff:fe3d:76a4", "fe80::446d:53ff:febc:ede3", "fe80::841e:e9ff:feda:42e", "fe80::6086:33ff:feee:6f81", "fe80::d45a:b8ff:fe91:c903", "fe80::14a4:11ff:fed1:a6a5", "fe80::dc56:88ff:fe6d:a3ba", "fe80::2c5d:f2ff:fec6:b0eb", "fe80::2803:d9ff:fe72:8edd", "fe80::8ce0:aff:fef4:f9df", "fe80::2c9d:77ff:fed3:b27a", "fe80::8875:6ff:fe86:64f9", "fe80::bc5e:83ff:feb7:348a", "fe80::d483:27ff:fecb:8415", "fe80::c0ba:99ff:fe90:cedf", "fe80::ac20:56ff:fe39:da07", "fe80::7c8b:dfff:fe76:c1ee", "fe80::7016:58ff:fef8:e362", "fe80::5482:78ff:fefc:2464", "fe80::586d:18ff:fe90:f306" ], "name": "my_host.my.dom", "id": "06d09b925a0246dcbd044c0036e01c9c", "mac": [ "00:16:3e:00:00:00", "02:42:73:00:5c:0a", "0a:66:04:83:88:73", "0e:5e:04:ff:61:bb", "16:a4:11:d1:a6:a5", "18:c0:4d:7a:57:96", "18:c0:4d:7a:57:97", "2a:03:d9:72:8e:dd", "2a:f8:b1:d1:37:35", "2e:5d:f2:c6:b0:eb", "2e:9d:77:d3:b2:7a", "36:fa:29:12:af:8a", "46:6d:53:bc:ed:e3", "56:82:78:fc:24:64", "5a:6d:18:90:f3:06", "62:86:33:ee:6f:81", "72:16:58:f8:e3:62", "72:59:d7:3d:76:a4", "7e:8b:df:76:c1:ee", "82:14:f6:19:bd:2a", "86:1e:e9:da:04:2e", "8a:75:06:86:64:f9", "8e:e0:0a:f4:f9:df", "96:54:a4:e7:cd:bf", "ae:20:56:39:da:07", "b6:c7:89:39:ef:4d", "b8:59:9f:d2:02:22", "be:5e:83:b7:34:8a", "c2:ba:99:90:ce:df", "d6:5a:b8:91:c9:03", "d6:83:27:cb:84:15", "de:56:88:6d:a3:ba", "de:d6:06:e9:c1:4d", "f6:9b:71:f1:19:93" ], "architecture": "x86_64" }, "elastic_agent": { "id": "4ba5a4aa-848d-4435-a13e-ace9584cddaa", "version": "8.4.2", "snapshot": false }, "metricset": { "period": 10000, "name": "stats" }, "event": { "duration": 28331585, "agent_id_status": "verified", "ingested": "2022-10-09T05:12:52Z", "module": "beat", "dataset": "elastic_agent.filebeat" } }, "fields": { "beat.stats.cgroup.cpu.id": [ "elastic-agent.service" ], "beat.stats.libbeat.output.write.bytes": [ 33985770829 ], "beat.stats.libbeat.pipeline.events.failed": [ 0 ], "elastic_agent.version": [ "8.4.2" ], "beat.stats.libbeat.pipeline.events.published": [ 23940442 ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "my_host.my.dom" ], "host.mac": [ "00:16:3e:00:00:00", "02:42:73:00:5c:0a", "0a:66:04:83:88:73", "0e:5e:04:ff:61:bb", "16:a4:11:d1:a6:a5", "18:c0:4d:7a:57:96", "18:c0:4d:7a:57:97", "2a:03:d9:72:8e:dd", "2a:f8:b1:d1:37:35", "2e:5d:f2:c6:b0:eb", "2e:9d:77:d3:b2:7a", "36:fa:29:12:af:8a", "46:6d:53:bc:ed:e3", "56:82:78:fc:24:64", "5a:6d:18:90:f3:06", "62:86:33:ee:6f:81", "72:16:58:f8:e3:62", "72:59:d7:3d:76:a4", "7e:8b:df:76:c1:ee", "82:14:f6:19:bd:2a", "86:1e:e9:da:04:2e", "8a:75:06:86:64:f9", "8e:e0:0a:f4:f9:df", "96:54:a4:e7:cd:bf", "ae:20:56:39:da:07", "b6:c7:89:39:ef:4d", "b8:59:9f:d2:02:22", "be:5e:83:b7:34:8a", "c2:ba:99:90:ce:df", "d6:5a:b8:91:c9:03", "d6:83:27:cb:84:15", "de:56:88:6d:a3:ba", "de:d6:06:e9:c1:4d", "f6:9b:71:f1:19:93" ], "beat.stats.cpu.user.time.ms": [ 12374560 ], "service.type": [ "beat" ], "beat.stats.beat.type": [ "filebeat" ], "beat.stats.libbeat.config.starts": [ 20 ], "host.os.version": [ "20.04.5 LTS (Focal Fossa)" ], "beat.stats.cgroup.cpu.cfs.quota.us": [ 0 ], "agent.name": [ "my_host.my.dom" ], "beat.stats.libbeat.config.stops": [ 0 ], "beat.stats.cgroup.memory.id": [ "elastic-agent.service" ], "event.agent_id_status": [ "verified" ], "beat.stats.cpu.total.value": [ 13560390 ], "beat.stats.libbeat.output.events.duplicates": [ 71111 ], "host.os.type": [ "linux" ], "beat.stats.libbeat.output.events.batches": [ 479155 ], "beat.stats.libbeat.pipeline.events.filtered": [ 12 ], "host.architecture": [ "x86_64" ], "agent.id": [ "4ba5a4aa-848d-4435-a13e-ace9584cddaa" ], "beat.stats.cpu.user.ticks": [ 12374560 ], "beat.stats.cgroup.cpu.stats.periods": [ 0 ], "host.containerized": [ false ], "service.address": [ "http://unix/stats" ], "beat.stats.libbeat.output.events.active": [ 50 ], "beat.stats.libbeat.pipeline.events.total": [ 23940457 ], "beat.stats.cpu.total.ticks": [ 13560390 ], "beat.stats.info.version": [ "8.4.2" ], "beat.stats.handles.limit.hard": [ 524288 ], "beat.stats.runtime.goroutines": [ 289 ], "beat.stats.beat.name": [ "my_host.my.dom" ], "beat.stats.beat.version": [ "8.4.2" ], "beat.stats.libbeat.pipeline.queue.acked": [ 23936326 ], "beat.stats.libbeat.config.running": [ 0 ], "host.ip": [ "fe80::1ac0:4dff:fe7a:5796", "1.1.1.1", "fe80::ba59:9fff:fed2:222", "1.1.1.1", "192.168.43.1", "fe80::42:73ff:fe00:5c0a", "10.0.3.1", "fe80::c5e:4ff:feff:61bb", "fe80::b4c7:89ff:fe39:ef4d", "fe80::8014:f6ff:fe19:bd2a", "fe80::f49b:71ff:fef1:1993", "fe80::9454:a4ff:fee7:cdbf", "fe80::dcd6:6ff:fee9:c14d", "fe80::34fa:29ff:fe12:af8a", "fe80::866:4ff:fe83:8873", "fe80::7059:d7ff:fe3d:76a4", "fe80::446d:53ff:febc:ede3", "fe80::841e:e9ff:feda:42e", "fe80::6086:33ff:feee:6f81", "fe80::d45a:b8ff:fe91:c903", "fe80::14a4:11ff:fed1:a6a5", "fe80::dc56:88ff:fe6d:a3ba", "fe80::2c5d:f2ff:fec6:b0eb", "fe80::2803:d9ff:fe72:8edd", "fe80::8ce0:aff:fef4:f9df", "fe80::2c9d:77ff:fed3:b27a", "fe80::8875:6ff:fe86:64f9", "fe80::bc5e:83ff:feb7:348a", "fe80::d483:27ff:fecb:8415", "fe80::c0ba:99ff:fe90:cedf", "fe80::ac20:56ff:fe39:da07", "fe80::7c8b:dfff:fe76:c1ee", "fe80::7016:58ff:fef8:e362", "fe80::5482:78ff:fefc:2464", "fe80::586d:18ff:fe90:f306" ], "agent.type": [ "metricbeat" ], "beat.stats.libbeat.output.read.errors": [ 0 ], "beat.stats.memstats.gc_next": [ 322126080 ], "elastic_agent.snapshot": [ false ], "beat.id": [ "a97378a4-c48f-45c2-901b-7b8b5f6b6502" ], "host.id": [ "06d09b925a0246dcbd044c0036e01c9c" ], "beat.stats.system.load.15": [ 8.53 ], "beat.stats.libbeat.pipeline.clients": [ 20 ], "elastic_agent.id": [ "4ba5a4aa-848d-4435-a13e-ace9584cddaa" ], "beat.stats.info.uptime.ms": [ 27149727 ], "beat.stats.cgroup.cpuacct.total.ns": [ 4050343270691067 ], "metricset.period": [ 10000 ], "host.os.codename": [ "focal" ], "beat.stats.libbeat.output.events.dropped": [ 23671 ], "event.duration": [ 28331585 ], "beat.stats.system.cpu.cores": [ 128 ], "event.ingested": [ "2022-10-09T05:12:52.000Z" ], "@timestamp": [ "2022-10-09T05:12:51.923Z" ], "host.os.platform": [ "ubuntu" ], "data_stream.dataset": [ "elastic_agent.filebeat" ], "beat.stats.system.load.norm.15": [ 0.0666 ], "agent.ephemeral_id": [ "2a66e283-d27a-433b-8b39-ae7105878348" ], "beat.stats.handles.limit.soft": [ 1024 ], "beat.stats.info.name": [ "filebeat" ], "beat.type": [ "filebeat" ], "beat.stats.libbeat.config.reloads": [ 0 ], "beat.stats.libbeat.output.events.acked": [ 23841544 ], "beat.stats.beat.uuid": [ "a97378a4-c48f-45c2-901b-7b8b5f6b6502" ], "host.os.name": [ "Ubuntu" ], "beat.stats.uptime.ms": [ 27149727 ], "beat.stats.libbeat.output.type": [ "elasticsearch" ], "host.name": [ "my_host.my.dom" ], "beat.stats.libbeat.output.events.toomany": [ 0 ], "beat.stats.libbeat.output.events.failed": [ 0 ], "beat.stats.libbeat.pipeline.events.dropped": [ 0 ], "beat.stats.beat.host": [ "idsece02.log.unibe.ch" ], "data_stream.type": [ "metrics" ], "ecs.version": [ "8.0.0" ], "beat.stats.cgroup.cpu.cfs.period.us": [ 100000 ], "agent.version": [ "8.4.2" ], "host.os.family": [ "debian" ], "beat.stats.cpu.total.time.ms": [ 13560390 ], "beat.stats.libbeat.output.write.errors": [ 0 ], "beat.stats.memstats.memory.total": [ 1581247076272 ], "beat.stats.handles.open": [ 148 ], "beat.stats.libbeat.output.events.total": [ 23936376 ], "beat.stats.memstats.memory.alloc": [ 286642512 ], "beat.stats.system.load.norm.5": [ 0.0672 ], "beat.stats.system.load.norm.1": [ 0.0559 ], "beat.stats.libbeat.pipeline.events.active": [ 4119 ], "event.module": [ "beat" ], "host.os.kernel": [ "5.4.0-104-generic" ], "beat.stats.cgroup.cpu.stats.throttled.ns": [ 0 ], "beat.stats.info.ephemeral_id": [ "09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42" ], "beat.stats.cgroup.cpu.stats.throttled.periods": [ 0 ], "beat.stats.cpu.system.time.ms": [ 1185830 ], "service.name": [ "beat" ], "beat.elasticsearch.cluster.id": [ "tOaF05W8QCG1Q3z25K4WZQ" ], "data_stream.namespace": [ "ece" ], "beat.stats.system.load.1": [ 7.16 ], "beat.stats.cgroup.memory.mem.usage.bytes": [ 2462949376 ], "beat.stats.cpu.system.ticks": [ 1185830 ], "beat.stats.cgroup.memory.mem.limit.bytes": [ 53687091200 ], "beat.stats.libbeat.output.read.bytes": [ 515468121 ], "beat.stats.system.load.5": [ 8.6 ], "beat.stats.cgroup.cpuacct.id": [ "elastic-agent.service" ], "metricset.name": [ "stats" ], "beat.stats.libbeat.pipeline.events.retry": [ 2048 ], "beat.stats.memstats.rss": [ 546992128 ], "event.dataset": [ "elastic_agent.filebeat" ] } }

my issue is with:

"beat.stats.libbeat.output.events.dropped": [ 23671 ],

and not with

"beat.stats.libbeat.pipeline.events.dropped": [ 0 ],

@6fears7 Does this mean that my events are not dropped in the pipelines?

zez3 avatar Oct 09 '22 05:10 zez3

So, I get a lot of such messages:

Click to expand!

{ "_index": ".ds-logs-elastic_agent.filebeat-ece-2022.09.28-000029", "_id": "u0iIu4MBozNqveEQYO8z", "_version": 1, "_score": 0, "_source": { "container": { "id": "elastic-agent-d3eb3e" }, "agent": { "name": "myhost.my.dom", "id": "7b48d297-3b7b-4993-9525-8d398719e062", "type": "filebeat", "ephemeral_id": "b841ef51-a943-47b0-b224-2aa86de88024", "version": "8.4.2" }, "service.name": "filebeat", "log": { "file": { "path": "/opt/Elastic/Agent/data/elastic-agent-d3eb3e/logs/default/filebeat-20221009-8.ndjson" }, "offset": 378916652 }, "elastic_agent": { "id": "7b48d297-3b7b-4993-9525-8d398719e062", "version": "8.4.2", "snapshot": false }, "message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 8, 56, 53, 910440446, time.Local), Meta:{"input_id":"udp-fortinet_fortigate-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","raw_index":"logs-fortinet_fortigate.log-ece_forti","stream_id":"udp-fortinet_fortigate.log-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","truncated":false}, Fields:{"agent":{"ephemeral_id":"f754f48a-8030-41f9-ae3f-dd4e7a8edbd9","id":"7b48d297-3b7b-4993-9525-8d398719e062","name":"myhost.my.dom","type":"filebeat","version":"8.4.2"},"data_stream":{"dataset":"fortinet_fortigate.log","namespace":"ece_forti","type":"logs"},"ecs":{"version":"8.0.0"},"elastic_agent":{"id":"7b48d297-3b7b-4993-9525-8d398719e062","snapshot":false,"version":"8.4.2"},"event":{"dataset":"fortinet_fortigate.log"},"input":{"type":"udp"},"log":{"source":{"address":"1.1.1.1:24676"}},"message":"\u003c190\u003edate=2022-10-09 time=08:56:53 devname=\"UniLabFW\" devid=\"FG100FTK20025470\" eventtime=1665298613860494320 tz=\"+0200\" logid=\"0112053203\" type=\"event\" subtype=\"connector\" level=\"information\" vd=\"Border\" logdesc=\"Dynamic address updated.\" addr=\"FCTEMS_ALL_FORTICLOUD_SERVERS\" msg=\"Updated tag FCTEMS_ALL_FORTICLOUD_SERVERS\"","tags":["fortinet-fortigate","fortinet-firewall","forwarded"]}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [fortinet.firewall.addr] of type [ip] in document with id 'VkiIu4MBozNqveEQSZDG'. Preview of field's value: 'FCTEMS_ALL_FORTICLOUD_SERVERS'","caused_by":{"type":"illegal_argument_exception","reason":"'FCTEMS_ALL_FORTICLOUD_SERVERS' is not an IP string literal."}}, dropping event!", "log.logger": "elasticsearch", "log.origin": { "file.line": 429, "file.name": "elasticsearch/client.go" }, "input": { "type": "filestream" }, "@timestamp": "2022-10-09T06:56:54.742Z", "ecs": { "version": "8.0.0" }, "data_stream": { "namespace": "ece", "type": "logs", "dataset": "elastic_agent.filebeat" }, "host": { "hostname": "myhost.my.dom", "os": { "kernel": "5.4.0-125-generic", "codename": "focal", "name": "Ubuntu", "type": "linux", "family": "debian", "version": "20.04.5 LTS (Focal Fossa)", "platform": "ubuntu" }, "containerized": false, "ip": [ "fe80::1ac0:4dff:fe7a:560a", "1.1.1.1", "fe80::ba59:9fff:fed2:4e6", "10.0.3.1", "192.168.43.1", "fe80::42:d3ff:fe9e:1acf", "fe80::602f:cbff:fee8:aba1", "fe80::b09f:c2ff:feda:9a3f", "fe80::5c79:49ff:febe:1ace", "fe80::d850:fbff:fe99:98fe", "fe80::f872:d1ff:fece:32cc", "fe80::ccd5:2dff:fe5b:32f8", "fe80::a6:65ff:fe4d:3a83", "fe80::80c8:c2ff:fe88:3e9d", "fe80::88a3:97ff:feb0:d411", "fe80::6c0e:2ff:fe33:8663", "fe80::e8cd:80ff:fe0b:94e8", "fe80::d429:7eff:fe77:efdf", "fe80::b882:f4ff:fefa:32dc", "fe80::cc7d:73ff:fe22:422f", "fe80::f0cc:c5ff:fe65:d8ec", "fe80::82:47ff:fe79:a2f9", "fe80::680e:d3ff:fe45:9e6", "fe80::2490:2aff:fe9f:1157", "fe80::944b:eaff:fef8:1ce6", "fe80::b4c8:fdff:fe59:6a6c", "fe80::45b:edff:fe5d:305e", "fe80::301b:16ff:fe60:7713", "fe80::a42d:eff:fe18:442a", "fe80::64cb:b9ff:fe97:6305", "fe80::600b:ebff:fedb:2e4f", "fe80::dc91:3bff:fe29:b75b", "fe80::5042:7eff:fe93:f008", "fe80::605f:84ff:fea7:9128" ], "name": "myhost.my.dom", "id": "aa3d81edad7148488d332274f0ec28e1", "mac": [ "00:16:3e:00:00:00", "02:42:d3:9e:1a:cf", "02:82:47:79:a2:f9", "02:a6:65:4d:3a:83", "06:5b:ed:5d:30:5e", "18:c0:4d:7a:56:0a", "18:c0:4d:7a:56:0b", "26:90:2a:9f:11:57", "32:1b:16:60:77:13", "52:42:7e:93:f0:08", "5e:79:49:be:1a:ce", "62:0b:eb:db:2e:4f", "62:2f:cb:e8:ab:a1", "62:5f:84:a7:91:28", "66:cb:b9:97:63:05", "68:05:ca:13:70:e2", "68:05:ca:13:70:e3", "6a:0e:d3:45:09:e6", "6e:0e:02:33:86:63", "82:c8:c2:88:3e:9d", "8a:a3:97:b0:d4:11", "96:4b:ea:f8:1c:e6", "a6:2d:0e:18:44:2a", "b2:9f:c2:da:9a:3f", "b6:c8:fd:59:6a:6c", "b8:59:9f:d2:04:e6", "ba:82:f4:fa:32:dc", "ce:7d:73:22:42:2f", "ce:d5:2d:5b:32:f8", "d6:29:7e:77:ef:df", "da:50:fb:99:98:fe", "de:91:3b:29:b7:5b", "ea:cd:80:0b:94:e8", "f2:cc:c5:65:d8:ec", "fa:72:d1:ce:32:cc" ], "architecture": "x86_64" }, "log.level": "warn", "event": { "agent_id_status": "verified", "ingested": "2022-10-09T06:57:00Z", "dataset": "elastic_agent.filebeat" } }, "fields": { "elastic_agent.version": [ "8.4.2" ], "host.os.name.text": [ "Ubuntu" ], "host.hostname": [ "myhost.my.dom" ], "host.mac": [ "00:16:3e:00:00:00", "02:42:d3:9e:1a:cf", "02:82:47:79:a2:f9", "02:a6:65:4d:3a:83", "06:5b:ed:5d:30:5e", "18:c0:4d:7a:56:0a", "18:c0:4d:7a:56:0b", "26:90:2a:9f:11:57", "32:1b:16:60:77:13", "52:42:7e:93:f0:08", "5e:79:49:be:1a:ce", "62:0b:eb:db:2e:4f", "62:2f:cb:e8:ab:a1", "62:5f:84:a7:91:28", "66:cb:b9:97:63:05", "68:05:ca:13:70:e2", "68:05:ca:13:70:e3", "6a:0e:d3:45:09:e6", "6e:0e:02:33:86:63", "82:c8:c2:88:3e:9d", "8a:a3:97:b0:d4:11", "96:4b:ea:f8:1c:e6", "a6:2d:0e:18:44:2a", "b2:9f:c2:da:9a:3f", "b6:c8:fd:59:6a:6c", "b8:59:9f:d2:04:e6", "ba:82:f4:fa:32:dc", "ce:7d:73:22:42:2f", "ce:d5:2d:5b:32:f8", "d6:29:7e:77:ef:df", "da:50:fb:99:98:fe", "de:91:3b:29:b7:5b", "ea:cd:80:0b:94:e8", "f2:cc:c5:65:d8:ec", "fa:72:d1:ce:32:cc" ], "log.logger": [ "elasticsearch" ], "container.id": [ "elastic-agent-d3eb3e" ], "host.ip": [ "fe80::1ac0:4dff:fe7a:560a", "1.1.1.1", "fe80::ba59:9fff:fed2:4e6", "10.0.3.1", "192.168.43.1", "fe80::42:d3ff:fe9e:1acf", "fe80::602f:cbff:fee8:aba1", "fe80::b09f:c2ff:feda:9a3f", "fe80::5c79:49ff:febe:1ace", "fe80::d850:fbff:fe99:98fe", "fe80::f872:d1ff:fece:32cc", "fe80::ccd5:2dff:fe5b:32f8", "fe80::a6:65ff:fe4d:3a83", "fe80::80c8:c2ff:fe88:3e9d", "fe80::88a3:97ff:feb0:d411", "fe80::6c0e:2ff:fe33:8663", "fe80::e8cd:80ff:fe0b:94e8", "fe80::d429:7eff:fe77:efdf", "fe80::b882:f4ff:fefa:32dc", "fe80::cc7d:73ff:fe22:422f", "fe80::f0cc:c5ff:fe65:d8ec", "fe80::82:47ff:fe79:a2f9", "fe80::680e:d3ff:fe45:9e6", "fe80::2490:2aff:fe9f:1157", "fe80::944b:eaff:fef8:1ce6", "fe80::b4c8:fdff:fe59:6a6c", "fe80::45b:edff:fe5d:305e", "fe80::301b:16ff:fe60:7713", "fe80::a42d:eff:fe18:442a", "fe80::64cb:b9ff:fe97:6305", "fe80::600b:ebff:fedb:2e4f", "fe80::dc91:3bff:fe29:b75b", "fe80::5042:7eff:fe93:f008", "fe80::605f:84ff:fea7:9128" ], "agent.type": [ "filebeat" ], "host.os.version": [ "20.04.5 LTS (Focal Fossa)" ], "host.os.kernel": [ "5.4.0-125-generic" ], "host.os.name": [ "Ubuntu" ], "log.level": [ "warn" ], "agent.name": [ "myhost.my.dom" ], "elastic_agent.snapshot": [ false ], "host.name": [ "myhost.my.dom" ], "event.agent_id_status": [ "verified" ], "host.id": [ "aa3d81edad7148488d332274f0ec28e1" ], "log.origin.file.line": [ 429 ], "service.name": [ "filebeat" ], "host.os.type": [ "linux" ], "elastic_agent.id": [ "7b48d297-3b7b-4993-9525-8d398719e062" ], "data_stream.namespace": [ "ece" ], "host.os.codename": [ "focal" ], "input.type": [ "filestream" ], "log.offset": [ 378916652 ], "message": [ "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 8, 56, 53, 910440446, time.Local), Meta:{"input_id":"udp-fortinet_fortigate-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","raw_index":"logs-fortinet_fortigate.log-ece_forti","stream_id":"udp-fortinet_fortigate.log-4cd23f99-4dfb-401f-817f-5ef6f9d4ca05","truncated":false}, Fields:{"agent":{"ephemeral_id":"f754f48a-8030-41f9-ae3f-dd4e7a8edbd9","id":"7b48d297-3b7b-4993-9525-8d398719e062","name":"myhost.my.dom","type":"filebeat","version":"8.4.2"},"data_stream":{"dataset":"fortinet_fortigate.log","namespace":"ece_forti","type":"logs"},"ecs":{"version":"8.0.0"},"elastic_agent":{"id":"7b48d297-3b7b-4993-9525-8d398719e062","snapshot":false,"version":"8.4.2"},"event":{"dataset":"fortinet_fortigate.log"},"input":{"type":"udp"},"log":{"source":{"address":"1.1.1.1:24676"}},"message":"\u003c190\u003edate=2022-10-09 time=08:56:53 devname=\"UniLabFW\" devid=\"FG100FTK20025470\" eventtime=1665298613860494320 tz=\"+0200\" logid=\"0112053203\" type=\"event\" subtype=\"connector\" level=\"information\" vd=\"Border\" logdesc=\"Dynamic address updated.\" addr=\"FCTEMS_ALL_FORTICLOUD_SERVERS\" msg=\"Updated tag FCTEMS_ALL_FORTICLOUD_SERVERS\"","tags":["fortinet-fortigate","fortinet-firewall","forwarded"]}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [fortinet.firewall.addr] of type [ip] in document with id 'VkiIu4MBozNqveEQSZDG'. Preview of field's value: 'FCTEMS_ALL_FORTICLOUD_SERVERS'","caused_by":{"type":"illegal_argument_exception","reason":"'FCTEMS_ALL_FORTICLOUD_SERVERS' is not an IP string literal."}}, dropping event!" ], "data_stream.type": [ "logs" ], "host.architecture": [ "x86_64" ], "event.ingested": [ "2022-10-09T06:57:00.000Z" ], "@timestamp": [ "2022-10-09T06:56:54.742Z" ], "log.origin.file.name": [ "elasticsearch/client.go" ], "agent.id": [ "7b48d297-3b7b-4993-9525-8d398719e062" ], "host.containerized": [ false ], "ecs.version": [ "8.0.0" ], "host.os.platform": [ "ubuntu" ], "data_stream.dataset": [ "elastic_agent.filebeat" ], "log.file.path": [ "/opt/Elastic/Agent/data/elastic-agent-d3eb3e/logs/default/filebeat-20221009-8.ndjson" ], "agent.ephemeral_id": [ "b841ef51-a943-47b0-b224-2aa86de88024" ], "agent.version": [ "8.4.2" ], "host.os.family": [ "debian" ], "event.dataset": [ "elastic_agent.filebeat" ] } }

In the above case they come from an official Fortinet integration

but I also see other from my custom parsing:


"message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.October, 9, 7, 17, 38, 79000000, time.UTC), Meta:{\"input_id\":\"udp-udp-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"raw_index\":\"logs-udp.generic-ece_wlc_logs\",\"stream_id\":\"udp-udp.generic-e455bb9c-05f0-4d42-bd13-60593992bb55\",\"truncated\":false}, Fields:{\"agent\":{\"ephemeral_id\":\"09bd12a2-0e46-4f6a-9c34-cf67b4a8ae42\",\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"name\":\"myhost.my.dom\",\"type\":\"filebeat\",\"version\":\"8.4.2\"},\"client\":{\"mac\":\"22:7c:71:90:17:a5\"},\"data_stream\":{\"dataset\":\"udp.generic\",\"namespace\":\"ece_wlc_logs\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id\":\"4ba5a4aa-848d-4435-a13e-ace9584cddaa\",\"snapshot\":false,\"version\":\"8.4.2\"},\"event\":{\"action\":\"Client Authenticated\",\"dataset\":\"udp.generic\",\"provider\":\"APF-3-AUTHENTICATION_TRAP\",\"timezone\":\"CEST\"},\"input\":{\"type\":\"udp\"},\"log\":{\"logger\":\"haSSOServiceTask5\",\"source\":{\"address\":\"1.1.1.1:32837\"},\"syslog\":{\"hostname\":\"cisco-wlc-mgmt\"}},\"message\":\"\\u003c139\\u003ecisco-wlc-mgmt: *haSSOServiceTask5: Oct 09 09:17:38.079: %APF-3-AUTHENTICATION_TRAP: [PS]apf_80211.c:19558 Client Authenticated: MACAddress:22:7c:71:90:17:a5 Base Radio MAC:00:f2:8b:4d:bf:00 Slot:1 User Name:unknown Ip Address:unknown SSID:public-unibe\",\"oldwlc\":{\"orig\":{}},\"orig\":{\"timestamp\":\"Oct 09 09:17:38.079\"},\"source\":{\"ip\":\"unknown\",\"mac\":\"22:7c:71:90:17:a5\"},\"tags\":[\"wlc\",\"forwarded\",\"_dns_reverse_lookup_failed\"],\"user\":{\"name\":\"unknown\"},\"wlc\":{\"baseradiomac\":\"00:f2:8b:4d:bf:00\",\"ssid\":\"public-unibe\"}}, Private:interface {}(nil), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [source.ip] of type [ip] in document with id 'YWObu4MBF-Kn07JyVca4'. Preview of field's value: 'unknown'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"'unknown' is not an IP string literal.\"}}, dropping event!",

zez3 avatar Oct 09 '22 07:10 zez3

I think it is important, whatever the explanation given here, to update the Kibana UI to explain the metrics in layman's terms and to give optical feedback of when data loss is occurring, together with links to docs on how to deal with this.

nkakouros avatar Oct 11 '22 22:10 nkakouros

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

botelastic[bot] avatar Oct 11 '23 23:10 botelastic[bot]

+1, We are using Datadog to collect these metrics:

  • libbeat.output.events.dropped
  • libbeat.pipeline.events.dropped
  • libbeat.output.events.failed
  • libbeat.pipeline.events.failed

So it would be nice to have an explanation about their meaning in the documentation.

ajax-shmyrko-o avatar Jan 05 '24 12:01 ajax-shmyrko-o