logstash-filter-verifier Daemon mode - shorten the time for a test cycle (modify/execute tests)

Daemon mode - shorten the time for a test cycle (modify/execute tests)

Open breml opened this issue 3 years ago • 22 comments

The current behavior of Logstash Filter Verifier (LFV) is to start a new instance of Logstash for every test run. With the usage of the --sockets flag (only available on systems with unix domain sockets), this can be reduced to a new instance of Logstash for every test suite run. Because of the quite slow start time of Logstash it self, the test cycle (modify the configuration, execute the tests) and with this also the feedback loop is pretty slow. Most of the time is lost due to the slow startup of Logstash (including JVM and JRuby). Since Logstash 2.3 there is support in Logstash to detect changes to the configuration and auto reloading of the configuration. So an idea to reduce the time for the test cycle is to keep Logstash running, change the configuration of Logstash at run time and then perform the necessary tests. For this, the idea is to split LFV into two seperate parts, one is a long runnng process (daemon), which controls the also long running Logstash instance, the other is the tool to communicate with the daemon to start new test runs on demand und return the results. Depending of the final solution, it would improve the situation for users on Windows significantly, because these users can not use the --sockets feature today.

The idea of this issue is to discuss this idea and how it could be implemented in LFV.

Nov 27 '20 08:11 breml

Over the weekend, I did some thinking about the implementation of the daemon mode and how it could work. Additionally I created a minimal prototype with some bash scripts to prove some of the ideas. So far, I still think it is doable. The good news is, that launching a new set of test cases is a matter of seconds, which really speeds up the whole thing a lot.

One of the major problems will be the coordination between the LFV daemon and Logstash in the sense, that the LFV daemon always has the correct information about the state Logstash is in. For example it needs to be able to detect, errors while loading the pipelines, for example if the Logstash configuration is not valid. One way of solving this issue is to process the Logstash log. I think this works best with the JSON log format.

An other thing, I am not yet sure which direction I want to take is the communication between the daemon an the client. Should it be possible to the run the daemon (and Logstash) on a different machine than the cli client or should this only work on localhost? In the first case, the Logstash config under test as well as the test case definitions need to be passed over the network, in the second case, it would be enough to just pass the path, where these files are located from the cli client to the daemon (because both are running on the same host).

So these are some of the findings and ideas so far. I will continue to work on this in the next days.

Dec 07 '20 19:12 breml

Given the security implications of accessing (and executing) things over the network I think we should start with just supporting a local LFV daemon. How would the communication between the client and the daemon work? It needs to be secure, support multiple daemons on the same host, and require minimal configuration. Cross-platform would be great but maybe not feasible. Is there a better option than a Unix socket? HTTP across the wire would be an obvious choice, but since it's not a public API we could just as well use gRPC if that's less cumbersome to use.

Dec 12 '20 19:12 magnusbaeck

I agree, as a start we should limit the usage to the same host (and exclude anything over the network). I assume, that if we really get the daemon mode running, this will become the new default due to the performance improvements this will bring. Therefore, one of the fundamental questions is, if Windows as operating system should be supported with LFV v2.0 or not, because the usage of Unix sockets does not work on Windows. Additionally, I can clearly see, that running LFV over the network could become an interesting thing. Not only on other hosts but also e.g. with Logstash (and LFV daemon) running as docker image. In this case, I feel like communicating with the daemon over the network would be a good choice for the future. If we go down this route (communication over the network), I think we would need to implement the "file transfer" part (send test cases and the Logstash config from the client to the daemon) also from the beginning. I imagine something like a test case archive with a predefined file/folder structure, which is sent as .zip file from the client to the daemon. If I recall correctly, I read an article about why the Go team has chosen .zip as the format for the Go modules. The argument there has been, that due to the structure of a .zip archive, it is pretty efficiently possible to read single files from the archive without the need of processing the whole thing. They have also implemented some nice packages to work with .zip files (also in memory).

I agree, that the API is not (and should not be) public. Therefore we can also start with what ever we are most comfortable with and improve later. From a performance point of view I see not a lot of difference. Currently I have not yet experience with gRPC (but this could also be an interesting topic to learn). Clearly the nice thing about gRPC would be the nice API one gets to work with in the code (client with the respective methods).

One other thing that I am thinking about is, the granularity of the API between the client and the daemon. With HTTP I would assume a rather "thick" API, that is the client sends a request containing the test cases and the Logstash config in one bulk, waits for the server to process everything and return the collected results to the client. With gRPC, one could think about a much more fine grained API, where the client first inits a new test run, then provides the Logstash config, then provides the content of first test case and so on. In the first case, more of the logic is in the daemon and the client is relatively dumb. In the second case, more logic is in the client. As said, I am not yet sure, which way is better. What do you think?

Dec 14 '20 21:12 breml

I agree, as a start we should limit the usage to the same host (and exclude anything over the network). I assume, that if we really get the daemon mode running, this will become the new default due to the performance improvements this will bring. Therefore, one of the fundamental questions is, if Windows as operating system should be supported with LFV v2.0 or not, because the usage of Unix sockets does not work on Windows.

If we limit ourselves to two execution modes, daemon mode and "classic" (one Logstash process per testcase file, which is the only execution mode supported on Windows today), we could continue to support Windows without much extra effort. The classic mode would also serve as an extra fool-proof fallback for other platforms.

Additionally, I can clearly see, that running LFV over the network could become an interesting thing. Not only on other hosts but also e.g. with Logstash (and LFV daemon) running as docker image. In this case, I feel like communicating with the daemon over the network would be a good choice for the future.

Yes, definitely, but as long as the file system could be shared between the client and the daemon Unix sockets can be used even if the daemon is running along with Logstash in a separate Docker container.

If we go down this route (communication over the network), I think we would need to implement the "file transfer" part (send test cases and the Logstash config from the client to the daemon) also from the beginning. I imagine something like a test case archive with a predefined file/folder structure, which is sent as .zip file from the client to the daemon.

Yes. This is easy enough that we might as well build this from the start instead of relying on shared file systems for the config files and testcases.

If I recall correctly, I read an article about why the Go team has chosen .zip as the format for the Go modules. The argument there has been, that due to the structure of a .zip archive, it is pretty efficiently possible to read single files from the archive without the need of processing the whole thing. They have also implemented some nice packages to work with .zip files (also in memory).

Yeah, unlike tar files zip files do have a central directory with all files and metadata. But the directory is naturally at the end of the file so it assumes random access. A counter example would be the build context in Docker image builds which is passed to the Docker daemon as a tar file.

I agree, that the API is not (and should not be) public. Therefore we can also start with what ever we are most comfortable with and improve later. From a performance point of view I see not a lot of difference. Currently I have not yet experience with gRPC (but this could also be an interesting topic to learn). Clearly the nice thing about gRPC would be the nice API one gets to work with in the code (client with the respective methods).

Yeah. I've written a lot of HTTP interaction code lately and it gets tedious with network errors, HTTP errors, deserialization errors, and application errors to deal with, and retry loops everywhere. I haven't used gRPC but I do have some experience with Stubby (Google's internal gRPC predecessor). If you're up for it I think this would be a good excuse for learning gRPC.

One other thing that I am thinking about is, the granularity of the API between the client and the daemon. With HTTP I would assume a rather "thick" API, that is the client sends a request containing the test cases and the Logstash config in one bulk, waits for the server to process everything and return the collected results to the client. With gRPC, one could think about a much more fine grained API, where the client first inits a new test run, then provides the Logstash config, then provides the content of first test case and so on. In the first case, more of the logic is in the daemon and the client is relatively dumb. In the second case, more logic is in the client. As said, I am not yet sure, which way is better. What do you think?

It would be possible to build a fine-grained API with HTTP too, but given the overhead in terms of code I'd probably opt for a thicker API if we choose HTTP. I think we'll want to have the thickness in the client to have more control, allow for more fine-grained error handling, and enable continuous user feedback.

Dec 20 '20 19:12 magnusbaeck

@magnusbaeck I did some work on a daemon mode proof of concept (currently in a private repo, I sent you an invite). This shows, how the daemon can control the Logstash instance and how we can replace the Logstash pipeline configuration for the test executions.

I also created two graphs, which show how everything plays together:

cli-daemon-logstash-sequence mmd

logstash-control-states mmd

So far, I have taken the following learning from the poc:

From what I can tell so far, I expect that the daemon mode will bring quite some changes to the "API" (e.g. flags) of the cli tool, so I expect some "breaking" changes there
- Unix sockets mode will be removed.
- Instead of providing the path to a Logstash filter configuration, one will provide the location of the pipelines.yml instead and the cli tool will the find and collect the respective Logstash configuration.
The daemon mode needs to inspect the provided Logstash pipeline and filter configuration in more detail, because it needs to be aware of each pipeline as well as of each input to the pipelines (which is not pipeline to pipeline communication). For this I updated https://github.com/breml/logstash-config, because this package did not yet export some of the necessary attributes to allow modifying existing Logstash configurations.
The format for the test cases will need some extensions as well, because for each test case one needs to define, to which pipeline/input the respective message should be sent.
gRPC for the communication between cli and daemon works fine and is pretty straight forward to use. Also communication over unix domain sockets does work.
Sending the Logstash pipeline configuration as a zip archive from the cli to the daemon works without problems.

Dec 31 '20 12:12 breml

The graphs look good! I'm glad that your PoC turned out well.

It would probably be worthwhile to try to optimize the pipeline reload operations. In the sequence diagram it looks like a new configuration is injected for each testcase (or testsuite?) but I imagine that's going to be slow when the configuration is large.

Instead of providing the path to a Logstash filter configuration, one will provide the location of the pipelines.yml instead and the cli tool will the find and collect the respective Logstash configuration.

For more complex cases we definitely want to be able to provide a pipelines.yml file as input, but perhaps we could generate the file in the simple cases? That should make the migration much easier for existing users.

The format for the test cases will need some extensions as well, because for each test case one needs to define, to which pipeline/input the respective message should be sent.

Yes, sounds reasonable. If we want to ease migration by generating a pipelines.yml as noted above perhaps we could have a sane default for the pipeline id for the testcases?

I think I can work on LFV Tuesday night. Is there anything in particular you want me to work on? The package structure maybe?

Jan 04 '21 22:01 magnusbaeck

Sorry for the delay, I guess Tuesday night has already passed. I guess, it would be a good starting point to refactor the package structure, before I start to combine my PoC with the existing code base. So yes, this would be a good place to start.

In my tests, the pipeline reload has been a matter of seconds (e.g. 1 - 2) and in comparison to the current situation it is way better. But it is true, I did not yet test with large configurations so far. That being said, Logstash does only reload the part (the pipelines), that have changed, so Logstash does not need to reload everything, which will speedup the process also for large configurations.

The reason, why I currently plan to perform a reload before every testcase/testsuite is, that I plan to change the way, how the test data is feed into the pipeline. I plan to create small "input" pipelines, that use the input generate to send the events. In these input pipelines I have the chance to create the event exactly the way I will need it for the test case. E.g. I am able to set the @timestamp to a defined value (to get reproducible tests), to populate @metadata or other nested fields. This becomes handy if one wants to simulate events, that come from e.g. one of the beats (Filebeat, Heartbeat). It should also be possible to set a unique ID for every message, which then could be used to order or correlate the processed events. As an other special case, we can omit e.g. the message field, if necessary (this is the case for events from Heartbeat).

I guess it should be possible to automatically generate the pipelines.yml for simple cases. I will keep this in mind, but it is not my main focus right now.

Jan 08 '21 15:01 breml

I updated my PoC and I am now able to run tests with multiple test cases in a single file as well as test sets consisting of multiple test case files and evaluate the test result. For this I included the existing packages github.com/magnusbaeck/logstash-filter-verifier/logstash, github.com/magnusbaeck/logstash-filter-verifier/observer and github.com/magnusbaeck/logstash-filter-verifier/testcase. This turned out to work pretty smooth.

From the provided test case files the message as well as the fields (currently with some limitation) are also passed to the daemon and with this considered for the execution of the test cases.

Some numbers on my machine:

Startup of the daemon, until ready for the first test case (including startup of Logstash): ~12 seconds

All the below tests are run with a simple Logstash config. Time is the execution time of the cli tool which submits the tests to the daemon, waits for the results and evaluates the results against the expected values.

Execute 1 test suite file with 1 test case: ~3 seconds Execute 1 test suite file with 100 test cases: ~3 seconds (does not really differ to just test a single test case) Execute 10 test suite files with 1 test case: ~13 seconds Execute 10 test suite files with 100 test cases: ~14 seconds Execute 20 test suite files with 100 test cases: ~26 seconds

So this is 2000 tests in ~26 seconds. To me, these numbers look pretty promising and this is without optimization so far.

Summary: ☑ All tests: 2000/2000
         ☑ testcase1 copy.json: 100/100
         ☑ testcase1.json: 100/100
         ☑ testcase10 copy.json: 100/100
         ☑ testcase10.json: 100/100
         ☑ testcase2 copy.json: 100/100
         ☑ testcase2.json: 100/100
         ☑ testcase3 copy.json: 100/100
         ☑ testcase3.json: 100/100
         ☑ testcase4 copy.json: 100/100
         ☑ testcase4.json: 100/100
         ☑ testcase5 copy.json: 100/100
         ☑ testcase5.json: 100/100
         ☑ testcase6 copy.json: 100/100
         ☑ testcase6.json: 100/100
         ☑ testcase7 copy.json: 100/100
         ☑ testcase7.json: 100/100
         ☑ testcase8 copy.json: 100/100
         ☑ testcase8.json: 100/100
         ☑ testcase9 copy.json: 100/100
         ☑ testcase9.json: 100/100
[I]  1.24  11:56:03  25s 958ms  master  2✎  20+  
 ~/projects/mustache/mustache  $

Jan 25 '21 11:01 breml

So based on the above results I feel like we should start to think about the integration into LFV. The main topic for me is, how much are you willing to change in LFV in regards to the "API". There are the following topics:

To start LFV in different modes, we need to define the respective cli API (flags, commands). Currently my PoC does support the following commands: start (start as daemon), shutdown (shutdown a running daemon) and test (execute a test suite). In the future I can envision more commands like status (get the status of the daemon), check (only check the syntax of a Logstash config), lint (lint a Logstash config), fmt (convert a Logstash config into a well defined format). the question is, do you want to keep the existing cli API and handle all these cases with flags or are you open to a change of the cli API to support commands as well. (For convenience I used the cobra and viper packages for the PoC, which allowed me to also use a config file, which reduced the need to type all the flags).
My current PoC no longer takes the location of the Logstash config files as parameter but instead the location of the pipelines.yml. The Logstash config files are then located based on the paths in thenpipelines.yml.
Format of the test case files: To further extend the potential use cases, that are testable with LFV, I see the following changes to the test case file format:
- Allow events to be tested, which do not have a message field, e.g. if one wants to test the processing of events, that come from heartbeat or metricbeat.
- Add pre- and post processing steps like add mock time or cleanup internal fields like __lfv_* (see https://github.com/magnusbaeck/logstash-filter-verifier/issues/5#issuecomment-480791567)
- Add mocks for Logstash filters, which do call outs (fix for https://github.com/magnusbaeck/logstash-filter-verifier/issues/44)

Jan 25 '21 11:01 breml

An additional feature I just prototyped is, support for different values in the fields for every test case. As a side effect, every test case gets an unique id. This allows to group test cases with different values in the fields section in the same test case set. Additionally this allows for events, where no message field is present.

Jan 25 '21 13:01 breml

the question is, do you want to keep the existing cli API and handle all these cases with flags or are you open to a change of the cli API to support commands as well.

I think the earlier conclusion was that it'll be hard to keep the existing flag interface so I suggest we seize the opportunity and redesign it.

My current PoC no longer takes the location of the Logstash config files as parameter but instead the location of the pipelines.yml. The Logstash config files are then located based on the paths in thenpipelines.yml.

Ack.

Allow events to be tested, which do not have a message field, e.g. if one wants to test the processing of events, that come from heartbeat or metricbeat.

Is there currently a dependency to the existence of a message field? How so?

Add pre- and post processing steps like add mock time or cleanup internal fields like __lfv_*

Yeah. I suggest using jq for this. I've found https://github.com/itchyny/gojq to work quite well.

Add mocks for Logstash filters, which do call outs (fix for #44)

Ack.

Jan 25 '21 21:01 magnusbaeck

the question is, do you want to keep the existing cli API and handle all these cases with flags or are you open to a change of the cli API to support commands as well.

I think the earlier conclusion was that it'll be hard to keep the existing flag interface so I suggest we seize the opportunity and redesign it.

Good. Are you ok with https://github.com/spf13/cobra and https://github.com/spf13/viper for the command/flag and config file processing?

Allow events to be tested, which do not have a message field, e.g. if one wants to test the processing of events, that come from heartbeat or metricbeat.

Is there currently a dependency to the existence of a message field? How so?

In the current test case file we have the following structure:

{
  "fields": {
    ...
  },
  "testcases": [
    {
      "input": [
        "Oct  6 20:55:29 myhost myprogram[31993]: This is a test message"
      ],
      "expected": [
        ...
      ]
    }
  ]
}

The items in the testcases.input array are kind of triggering the events sent through the Logstash pipeline. Without an item in this array, there are no events emitted. The value in testcases.input is then put into the message field. Even if the item in testcases.input is the empty string (""), the message field is present on the event.

Add pre- and post processing steps like add mock time or cleanup internal fields like __lfv_*

Yeah. I suggest using jq for this. I've found https://github.com/itchyny/gojq to work quite well.

I agree, gojq could be an interesting option. I also like https://github.com/tidwall/gjson, but I think it is less suited here. That being said, for the pre- and post- processing I am more thinking about Logstash filters, like the example in https://github.com/magnusbaeck/logstash-filter-verifier/issues/5#issuecomment-480791567, because without this snipped, the content of @metadata is not available in the output. jq or gojq can then be used to maybe further postprocess the result returned from Logstash (e.g. to process ignored fields). I have an other Logstash filter (+ruby) snipped, that I use to control the time of events (especially in cases, where the time is not parsed from the message e.g. if working with heartbeat), so this snipped would be in the pre-processing stage.

After some more thinking, I see the possibility to add the new features in the test case file structure in a backward compatible way. In this case, the only breaking changes in regards to the "API" would be related to the cli/flags.

I will try to put together a proposal for the cli/commands/flags as well as for the extensions I have in mind for the test case files.

Jan 25 '21 22:01 breml

Good. Are you ok with https://github.com/spf13/cobra and https://github.com/spf13/viper for the command/flag and config file processing?

I haven't used either of them but they look fine. Looking at https://github.com/spf13/cobra#getting-started it seems they recommend having the main package in the root directory (like LFV has today) while the proposal I made in a comment for #93 suggested putting the main package(s) in subpackages of /cmd (which seems to be a common pattern). For the package restructuring I'll leave the main package alone for now to avoid unnecessary churn. I have no particular opinion of where the main package(s) end up.

The items in the testcases.input array are kind of triggering the events sent through the Logstash pipeline. Without an item in this array, there are no events emitted. The value in testcases.input is then put into the message field. Even if the item in testcases.input is the empty string (""), the message field is present on the event.

Oh, I see what you mean. If I have events with no message field it's typically because they're JSON strings, and then I just set the codec to json_lines. What did you have in mind to address this?

After some more thinking, I see the possibility to add the new features in the test case file structure in a backward compatible way.

Yes, I'd like to avoid backwards incompatible changes to the testcase files.

Jan 26 '21 19:01 magnusbaeck

Good. Are you ok with https://github.com/spf13/cobra and https://github.com/spf13/viper for the command/flag and config file processing?

I haven't used either of them but they look fine. Looking at https://github.com/spf13/cobra#getting-started it seems they recommend having the main package in the root directory (like LFV has today) while the proposal I made in a comment for #93 suggested putting the main package(s) in subpackages of /cmd (which seems to be a common pattern). For the package restructuring I'll leave the main package alone for now to avoid unnecessary churn. I have no particular opinion of where the main package(s) end up.

The decision where we put the main package (root directory or sub-directory in /cmd) is in my opinion not influenced by the package we use for flag, command and config file processing. So in my experience, it is more important to think about what is the main purpose of a Go module (or repository). For example, if the main purpose is to provide a library, which is expected to be included by others (e.g. https://github.com/breml/logstash-config), I would put the library code into the root directory and the commands in /cmd. The same is true, if I envision, that a project will emit multiple binaries (multiple main packages) in the future. On the other hand, if the main purpose is to provide a (single) tool (and this is how I currently feel about LFV), then I would put the main package into the root directory. Or put differently, in my opinion the code for the main purpose of a Go repo does belong into the root of the repo.

In regards to cobra and viper: I consider these packages as the best of breed for CLI tools (used e.g. by kubectl and other famous CLI tools written with Go).

The items in the testcases.input array are kind of triggering the events sent through the Logstash pipeline. Without an item in this array, there are no events emitted. The value in testcases.input is then put into the message field. Even if the item in testcases.input is the empty string (""), the message field is present on the event.

Oh, I see what you mean. If I have events with no message field it's typically because they're JSON strings, and then I just set the codec to json_lines. What did you have in mind to address this?

In my current PoC I have implemented the input/event emitting part with the following Logstash snippet:

input {
  generator {
    lines => [
      "0", "1"
    ]
    count => 1
    codec => plain
    threads => 1
  }
}

filter {
  mutate {
    # Remove the fields "sequence" and "host", which are automatically created by the generator input.
    remove_field => [ "host", "sequence" ]
    # We use the message as the LFV event ID, so move this to the right field.
    replace => {
      "[@metadata][__lfv_id]" => "%{[message]}"
    }
  }

  translate {
    dictionary_path => "fields.json"
    field => "[@metadata][__lfv_id]"
    destination => "[@metadata][__lfv_fields]"
    exact => true
    override => true
  }
  ruby {
    code => 'fields = event.get("[@metadata][__lfv_fields]")
             fields.each { |key, value| event.set(key, value) }'
  }
}

output {
  pipeline {
    send_to => [lfv_sut_in]
  }
}

fields.json

{
  "0": {
    "message": "test case message",
    "type": "syslog"
  },
  "1": {
    "message": "test case message 2",
    "type": "syslog"
  }
}

So what this snippet does is the following:

Use the generator input to generate the necessary number of events (based on the number of test cases in the test case file.
Remove every field automatically added by the generator input and move the event id from message to [@metadata][__lfv_id]
Enrich the event with all the necessary fields by looking up in fields.json with the translate filter based on the event id ([@metadata][__lfv_id]) and moving them from [@metadata][__lfv_fields] to the root of the event (by the ruby filter snippet).
Send the event via pipeline-to-pipeline communication to the "system-under-test", that is the filter pipeline, that should be tested.

Jan 27 '21 20:01 breml

Or put differently, in my opinion the code for the main purpose of a Go repo does belong into the root of the repo.

That makes total sense.

So what this snippet does is the following:

Interesting approach. I guess I don't understand why you can't generate a JSON file with

{"@metadata": {"__lfv_id": 0}, "message": "test case message", "type": "syslog"}}

and pass that to Logstash via e.g. some network input. (The @metadata field is special so that might screw things up, but there are ways around that.)

Jan 28 '21 21:01 magnusbaeck

Interesting approach. I guess I don't understand why you can't generate a JSON file with
{"@metadata": {"__lfv_id": 0}, "message": "test case message", "type": "syslog"}}
and pass that to Logstash via e.g. some network input. (The @metadata field is special so that might screw things up, but there are ways around that.)

I don't say, that the same is not possible with a network input. As you mention, one problem could be the @metatdata stuff. But for me, an other property of the generator input is way more important. Logstash stops the above pipeline after the generator input has emitted all events. Because of the count => 1 setting and the fact, that this pipeline does not contain other inputs, Logstash is able to determine, when a pipeline has finished its work. This shutdown is then logged by Logstash and I can read this information from the logs. This allows me to update the state machine. With a network input, this will not happen, because there is no way for Logstash to tell, when all events have been received, in fact Logstash will just continue to listen. In that regard, the generator input works pretty similar to the stdin input, if stdin receives EOF (is closed). But, the stdin plugin does prevent a reload for the pipeline, that includes the stdin input plugin and the stdin input plugin can (obviously) only be present once in a config.

Jan 29 '21 07:01 breml

In regards to the "API" of the command ./logstash-filter-verifier version 2 I have the following basic structure in mind:

The existing "standalone" mode for LFV: ./logstash-filter-verifier standalone. This sub-command could accept all the currently present command line arguments the same way it works today (with the exception of the command line flags, that we remove like e.g. everything socket related, if we stick to the plan to remove the socket based functionality).
Run the daemon: ./logstash-filter-verifier daemon.
Execute test cases in the daemon: ./logstash-filter-verifier test
Get the current status of the daemon: ./logstash-filter-verifier status

One open question is, what the "default" command (./logstash-filter-verifier, without sub-command) should do. I see the following possibilities:

Print usage/help information (the same as ./logstash-filter-verifier help and ./logstash-filter-verifier --help)
- Make the change from v1 to v2 obvious to every user and print helpful information on how to use the tool.
Default to the version 1 behavior to keep the backwards compatibility (the same as ./logstash-filter-verifier standalone)
- Arguments: Keep backwards compatibility.
Run the tests in the daemon (the same as ./logstash-filter-verifier test)
- Arguments: I expect this to be the most used command in the future.
We allow the default command to be configured in the config file. (Pretty advanced, but we would still need to define a default for the case no config file is found).
Something else

Maybe we should also consider renaming the command to something shorter (e.g. ./lfv). The existing command is so long to type.

Jan 29 '21 10:01 breml

But for me, an other property of the generator input is way more important. Logstash stops the above pipeline after the generator input has emitted all events. Because of the count => 1 setting and the fact, that this pipeline does not contain other inputs, Logstash is able to determine, when a pipeline has finished its work. This shutdown is then logged by Logstash and I can read this information from the logs.

Ah, I get it. Very clever! Too bad we have to jump through hoops like this though.

Regarding the command API, I don't have particularly strong opinions but I'm leaning towards making the command verb mandatory. Explicit is better than implicit. Yes, it's slightly more to type but I'd expect most invocations to be hardcoded into a CI job, a tox.ini file, a makefile or similar (or just the shell history). Especially if we can't keep flag compatibility, i.e. if people will have to rework their commands after a v2 upgrade anyway. If compatibility can be maintained there could be a point in making a verb-less command work like in v1.

For the same reasons I think we can keep the name of the binary.

There should probably be a daemon shutdown command too, right?

Jan 30 '21 18:01 magnusbaeck

Here a proposal for the updated test case file format. There are no breaking changes, but two fields I would propose to be removed for the future (deprecated). All the other fields are additions, which support the new test pipeline design as well as the pipeline to pipeline communication.

# The unique ID of the input plugin in the tested configuration, where the test
# input is coming from. This is necessary, if a multi-stage pipeline configuration
# with pipeline-to-pipeline communiction and multiple inputs is tested.
# https://www.elastic.co/guide/en/logstash/7.10/plugins-inputs-file.html#plugins-inputs-file-id
input: "file_input_1"

# If metadata is set, the value is used as the name of the key where the content
# of the [@metadata] field is exportet. The respective fields are then compared
# with the expected result of the testcase as well. (optional)
metadata: "__metadata"

# Filter mocks allow to replace an existing filter, identified by its ID, in the
# config with a mock implementation.
# This comes in handy, if a filter does perform a call out to an external system
# e.g. lookup in Elasticsearch.
# Because the existing filter is replaced with whatever is present in `filter`,
# it is also possible to remove a filter by simple keep the `filter` empty (or
# not present).
filter_mocks:
  - id: "elasticsearch-lookup-removed"
  - id: "elasticsearch-lookup-constant"
    filter: |
      # Constant lookup, does return the same result for each event / test case
      mutate {
        replace => {
          "[constant_field_from_elasticsearch]" => "constant mocked value from elasticsearch"
        }
      }
  - id: "elasticsearch-lookup-dynamic"
    filter: |
      # Dynamic lookup, does return different result for every event / test case
      # The dictionary maps based on the test case id
      translate {
        dictionary => {
          "syslog-test-message" => "dynamicly mocked value from elasticsearch"
        }
        field => "[@metadata][__lfv_id]"
        destination => "[dynamic_field_from_elasticsearch]"
        exact => true
        override => true
      }

# I am not yet sure about pre_process and post_process, so this will not be
# part of an initial implementation and maybe added later.
# Potential use cases are:
# * the above mentioned `metadata` is a case, which could be handled with
#   post_precess.
# * fix dates, if they are expected in a certain year, but the input does not
#   contain the year, like it is the case in some syslog or kubernetes date
#   formats.
#pre_process:
#post_process:

# Codec names the Logstash codec that should be used when events are read.
# This is normally "line" or "json_lines". (optional, default: line)
#
# This is no longer used with daemon mode.
codec: line

# Fields to be ignored, when the result is compared with the expected value.
# (optional)
ignore:
  - ignored_field

# Global fields, added to every event in this test case set, useful for fields,
# that are set by the input plugin or the are sent as constant values from the
# sending instance (e.g. filebeat, heartbeat) (optional)
fields:
  host: "localhost"
  type: "syslog"

# Global input contains the lines of input that should be fed to the Logstash
# process.
#
# (DEPRECATED, use input in testcases instead, we should remove this for
# version 2)
input:
  - "Oct  6 20:55:29 myhost myprogram[31993]: This is a test message"

# Global expected contains a slice of expected events to be compared to the
# actual events produced by the Logstash process.
#
# (DEPRECATED, use expected in thestcases instead, we should remove this for
# version 2)
expected:
  - "@timestamp": "2015-10-06T20:55:29.000Z"
    host: "myhost"
    message: "This is a test message"
    pid: 31993
    program: "myprogram"
    type: "syslog"

testcases:
  - # Unique id of the test case / test event (optional)
    # If set, LFV will use this id in the output (which tests have failed)
    # If not set, the id of the test case is its filename and index
    id: "syslog-test-message"

    # Description contains an optional description of the test case
    # which will be printed while the tests are executed. (optional)
    description: "Description of the current test case."

    # Local fields, only added to the events of this test case
    event:
      - # Overwrite the timestamp for an event.
        # The timestamp field is treated in a special way by Logstash, which
        # prevents this field to be set with the normal filters.
        # Therefore some additional logic is necessary to allow to set the @timestamp.
        "@timestamp": "2021-01-28T14:51:15.000Z"
        # An event, consisting of the following fields
        # Globaly defined field values with the same key are overwritten.
        host: "otherhost"
        # The input message can also be set directly in the fields, in this case,
        # the "input" key can be omited. But a present value in input will
        # overwrite this field value.
        message: "Oct  6 20:55:29 myhost myprogram[31993]: This is a test message"
        type: "syslog"
        fields:
          nested_field: value
        tags: [ "some", "tags" ]

    # input contains the lines of input that should be fed to the Logstash process.
    # input will be written to the field "message". An already present value
    # (defined globally or in event) is overwritten.
    input:
      - "Oct  6 20:55:29 myhost myprogram[31993]: This is a test message"

    # expected contains a slice of expected events to be compared to the
    # actual events produced by the Logstash process.
    expected:
      - "@timestamp": "2015-10-06T20:55:29.000Z"
        host: "myhost"
        message: "This is a test message"
        pid: 31993
        program: "myprogram"
        type: "syslog"

    # The unique ID of the output plugins in the tested configuration, where
    # the event leaves Logstash. (optional)
    # If no value is present, this criteria is not verified.
    # If a value is present, the event is expected to be processed by
    # the exact list of expected outputs.
    # By listing multiple output plugins it is possible to test Logstash
    # configurations with multiple (conditional) outputs:
    # e.g. save the events to elasticsearch and, if the threshold is above x,
    # additionally send an email.
    expected_outputs:
      - "elasticsearch_output_1"
      - "email_output_1"

Let me know what you think.

Feb 01 '21 14:02 breml

Looks good! The filter mock feature is really cool. I'll have use for that in one or two cases where I've currently used environment variables references in the config to redirect translate filters to use a testdata YAML file. Just two comments:

Do we really need the metadata option? Can't we just rename @metadata to __metadata (or whatever we choose) as the last thing in the pipeline, and then rename it back to @metadata in LFV before handing it to the diff so that things just transparently for the users? Or is that actually what you're proposing, just that the naming of the temporary workaround field should be configurable?
I've started work on a PR to get rid of the input and expected options.

Feb 06 '21 21:02 magnusbaeck

Looks good! The filter mock feature is really cool. I'll have use for that in one or two cases where I've currently used environment variables references in the config to redirect translate filters to use a testdata YAML file. Just two comments:

Do we really need the metadata option? Can't we just rename @metadata to __metadata (or whatever we choose) as the last thing in the pipeline, and then rename it back to @metadata in LFV before handing it to the diff so that things just transparently for the users? Or is that actually what you're proposing, just that the naming of the temporary workaround field should be configurable?

From a backwards compatibility point of view I think we need this option to be "opt-in". Otherwise, the test cases will fail for users, who use @metadata but did not check for the values in the test cases so far (because it was not a straight forward thing to do in the past). So the option I am proposing is both, give the user a way to opt-in for the @metadata values by setting the metadata key and altering the target field where the values for @metadata is stored by setting the metadata key to e.g. __metadata.

I personally have a lot of test cases, where I use my workaround (see https://github.com/magnusbaeck/logstash-filter-verifier/issues/5#issuecomment-480791567) and therefore I would prefer to keep __metadata as the resulting key or at least have the possibility to rename it to this value.

That being said, we can also "untangle" the two cases and have two different options e.g.

exportMetadata: true
metadataKey: __metadata

I've started work on a PR to get rid of the input and expected options.

Great, I will wait with the integration of the daemon mode into LFV until you have finished this work. Until then, I will continue to work on my PoC. After I made it work successfully I am now refactoring it to make the code cleaner and more maintainable for the future.

Feb 08 '21 07:02 breml

Yeah, let's use a bool option for enabling @metadata support and, if needed, a separate option for picking the field to use.

Getting rid of input and expected turned out to be quite trivial, although they'll continue to exist internally but won't be exposed to the testcase files. Getting rid of them entirely would be a lot more work since several places will suddenly have to start looping over testcases in a different way. We might want to perform that refactoring anyway. Let's see how things turn out.

Feb 10 '21 21:02 magnusbaeck

logstash-filter-verifier logstash-filter-verifier copied to clipboard

Daemon mode - shorten the time for a test cycle (modify/execute tests)

logstash-filter-verifier
logstash-filter-verifier copied to clipboard