beats icon indicating copy to clipboard operation
beats copied to clipboard

Add concrete types to logstash

Open fearful-symmetry opened this issue 3 years ago • 3 comments

What does this PR do?

This fixes rather odd bug that would happen the logstash module we receive a large integer value, handle it as a float, and then re-encode the json event as an integer or a float depending on the number of digits in the resulting number. This resulted in some mapping errors as fields would sometimes be floats, and sometimes be integers when received by logstash on the other end of metricbeat.

However, this ended up being a much more complex change than I anticipated, and I'd really like someone with some amount of logstash experience to go over this and make sure what I'm doing is correct. I'm particularly worried that some of these fields have changed across logstash versions in a way I don't know about. @mashhurs has done some work to document these fields, hence the review tag.

Why is it important?

This would result in mapping errors in cases where the logstash module events was being sent to logstash, and then finally to elasticsearch.

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

fearful-symmetry avatar Aug 03 '22 20:08 fearful-symmetry

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Aug 03 '22 20:08 elasticmachine

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-08-04T21:17:00.883+0000

  • Duration: 60 min 20 sec

Test stats :test_tube:

Test Results
Failed 0
Passed 3584
Skipped 887
Total 4471

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Aug 03 '22 20:08 elasticmachine

I've tested the change against a dummy pipeline[1] and it appears to break the Pipeline viewer[2] in the Stack Monitoring app when we look for a type property. When comparing the vertices generated by this change with the one generated by main branch a few properties are lost in translation[3]. I'm wondering how tests are successfully passing

[1]

input {
  java_generator {
    eps => 10
  }
}

output {
  elasticsearch {
    hosts => "http://localhost:9200"
    user => "elastic"
    password => "changeme"
  }
}

[2] Screenshot 2022-08-10 at 11 30 07

[3]

  ----- vertices from main branch -----
  "vertices": [
    {
      "plugin_type": "input",
      "type": "plugin",
      "config_name": "java_generator",
      "explicit_id": false,
      "id": "695d0e6c03a6147224cc8f2725e69bdab8d4973d986d9f3c0ba67144ff8b0112",
      "meta": {
        "source": {
          "id": "/Users/lacabane/Downloads/logstash-8.2.2/logstash.dev-1.conf",
          "line": 2,
          "protocol": "file",
          "column": 3
        }
      }
    },
    {
      "explicit_id": false,
      "id": "__QUEUE__",
      "meta": null,
      "type": "queue"
    },
    {
      "type": "plugin",
      "config_name": "elasticsearch",
      "explicit_id": false,
      "id": "59a9dfe22c583dac4e348cf98fbd4148f4f1a7b31d89c31f2f40b273aa693fb6",
      "meta": {
        "source": {
          "column": 3,
          "id": "/Users/lacabane/Downloads/logstash-8.2.2/logstash.dev-1.conf",
          "line": 8,
          "protocol": "file"
        }
      },
      "plugin_type": "output"
    }
  ]

  ----- vertices from this branch -----
  "vertices": [
    {
      "events_out": 0,
      "events_in": 0,
      "duration_in_millis": 0,
      "queue_push_duration_in_millis": 0,
      "id": "695d0e6c03a6147224cc8f2725e69bdab8d4973d986d9f3c0ba67144ff8b0112",
      "pipeline_ephemeral_id": "",
      "cluster_uuid": ""
    },
    {
      "queue_push_duration_in_millis": 0,
      "id": "__QUEUE__",
      "pipeline_ephemeral_id": "",
      "cluster_uuid": "",
      "events_out": 0,
      "events_in": 0,
      "duration_in_millis": 0
    },
    {
      "queue_push_duration_in_millis": 0,
      "id": "59a9dfe22c583dac4e348cf98fbd4148f4f1a7b31d89c31f2f40b273aa693fb6",
      "pipeline_ephemeral_id": "",
      "cluster_uuid": "",
      "events_out": 0,
      "events_in": 0,
      "duration_in_millis": 0
    }
  ]

klacabane avatar Aug 10 '22 09:08 klacabane

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change) To fixup this pull request, you need to add the backport labels for the needed branches, such as:
  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

mergify[bot] avatar Aug 16 '22 15:08 mergify[bot]

@matschaffer would you have any thoughts on this ? My assumption is that the metricbeat mapping would coerce any numeric type to long, and that this change is not necessary since the mapping would do that work for us. Maybe this issue could happen if a custom template with dynamic mapping takes precedence ?

klacabane avatar Aug 17 '22 15:08 klacabane

I have some questions. Quoting from the thread above but posting here since they don't have much to do with the type Vertex struct code directly.

It seems fairly hard to come up with an authoritative list of what can be in any one of these fields?

@robbavey or @karenzone might be able to help on this front.

this is only happening on fields that have enough trialing digits that when metricbeat reads it as a float

Is there an example of such a field in the test data? I was looking through the diff but maybe I just missed it.

I'm wondering about maybe simulating such a value (could maybe tweak LS source to always send a known bad value). Then we could explore the problem space a bit more.

The issue seems to be when events are sent to ES via logstash

Does it happen without logstash in the middle too?

If metricbeat is reading the long as a float I'd expect the issue to occur with or without logstash between metricbeat & ES.

matschaffer avatar Aug 18 '22 00:08 matschaffer

@matschaffer

Is there an example of such a field in the test data? I was looking through the diff but maybe I just missed it.

There's some examples in the associated customer case, but the issue is that beat's own JSON encoder has some logic such that if typeof(value) == float && number_of_digits(value) > x; then format_as_scientific_notation() so "floats" that are sufficiently large like 12345678.0 will become "1.2345678e7" or whatever, which is breaking something (ES? Logstash?) when it tries to read it as a long type, which I assume wouldn't happen if the float was rendered as "12345678.0"

This seems to be why the bug is so sporadic; we'll only hit a mapping failure if a JSON value is over a certain number of digits.

fearful-symmetry avatar Aug 18 '22 17:08 fearful-symmetry

Does it happen without logstash in the middle too?

To be clear, this case involves: Logstash <== [Metrics read by] == Metricbeat Logstash Module == [writes to] ==> Elasticseach

and not Metricbeat == [writes to] ==> Logstash == [writes to] ==> Elasticsearch

Logstash is not in the middle of the interaction, it is the subject being monitored.

bplies-ATX avatar Aug 18 '22 18:08 bplies-ATX

Logstash <== [Metrics read by] == Metricbeat Logstash Module == [writes to] ==> Elasticseach

Ah, must have misinterpreted the original issue, I thought metricbeat was going to logstash.

fearful-symmetry avatar Aug 18 '22 18:08 fearful-symmetry

Thanks!

@fearful-symmetry Is this logic in beats something we can override for given fields?

if typeof(value) == float && number_of_digits(value) > x; then format_as_scientific_notation()

If not we should probably set the mapping for logstash_stats.pipelines.queue.capacity.queue_size_in_bytes float in ES.

matschaffer avatar Aug 22 '22 02:08 matschaffer

Looks like one of the values from the case was 1073741824 which is a lot smaller than Java's Long.MAX_VALUE - could we see this on potentially any beats-integested long?

matschaffer avatar Aug 22 '22 02:08 matschaffer

Is this logic in beats something we can override for given fields?

@matschaffer We can, because we're using our own custom marshaller (for some reason), and that x value is actually a bit lower than what the built-in golang JSON marshaller has set. This was actually something I investigated, but I decided against it because it seemed like a bit of a hack; after all, it technically should be a long.

could we see this on potentially any beats-integested long

I suspect so, and the only reason we haven't run into this in other places is due to the combination of us only hitting it when we convert JSON long values into golang float64 and then back out to a field that should be a long.

Considering how...touchy this set of fields is compared to what I initially thought, I'm suspecting that the best way to do this is to set an override for the JSON encoder that just forces it to read everything as a float.

fearful-symmetry avatar Aug 22 '22 15:08 fearful-symmetry

Considering how...touchy this set of fields is compared to what I initially thought, I'm suspecting that the best way to do this is to set an override for the JSON encoder that just forces it to read everything as a float.

Well changes to this field set could definitely cause Stack Monitoring UI trouble so we'll want to verify like @klacabane did.

Ultimately whatever helps preserve the original value that logstash gives us sounds good to me.

If for some reason a value exceeds ES's (Java) long capacity, then we might have another issue on our hands but if beats can pass it through basically as-is I think that'd be best.

matschaffer avatar Aug 22 '22 21:08 matschaffer

Thinking we should close this in favor of https://github.com/elastic/beats/pull/32787 - feel free to reopen if I'm mistaken.

matschaffer avatar Aug 24 '22 02:08 matschaffer

@matschaffer , yep, the plan was to close this once I got CI to pass on that other PR, thanks ;)

fearful-symmetry avatar Aug 24 '22 16:08 fearful-symmetry