vector icon indicating copy to clipboard operation
vector copied to clipboard

enhancement(kubernetes_logs source): Wire up acknowledgement support via existing file ack support

Open ganelo opened this issue 10 months ago • 13 comments

Summary

Mirror changes made to file source as part of implementing acknowledgements support there over to kubernetes_logs source.

Change Type

  • [ ] Bug fix
  • [x] New feature
  • [ ] Non-functional (chore, refactoring, docs)
  • [ ] Performance

Is this a breaking change?

  • [ ] Yes
  • [x] No

How did you test this PR?

Given that the underlying feature work for implementing acknowledgements in the file source is well-tested, my plan is to deploy a custom build of vector with this change to a dev environment and simply smoke test to confirm that, when acknowledgements are enabled via config, we see evidence that log events are not considered processed until they are fully delivered at the end of the sink chain (eg by counting unique log event ids at the end of the test after bouncing some of the intermediate hops in chain during the test).

Does this PR include user facing changes?

  • [x] Yes. Please add a changelog fragment based on our guidelines.
  • [ ] No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

  • [x] Please read our Vector contributor resources.
    • make check-all is a good command to run locally. This check is defined here. Some of these checks might not be relevant to your PR. For Rust changes, at the very least you should run:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
  • [ ] If this PR introduces changes Vector dependencies (modifies Cargo.lock), please run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

  • Closes: https://github.com/vectordotdev/vector/issues/7462
  • Ref https://github.com/vectordotdev/vector/issues/7336 and https://github.com/vectordotdev/vector/issues/9856 for global tracking of ACK support across sources
  • Ref https://github.com/vectordotdev/vector/issues/7459 for ACK support in file source, which this PR depends on (implemented in https://github.com/vectordotdev/vector/pull/7816)
  • Ref https://github.com/vectordotdev/vector/issues/9856 (which was closed as a dupe of 7462)

ganelo avatar Feb 04 '25 20:02 ganelo

Hi @ganelo and thank you for this PR. The make check-component-docs check failed. Can you take a look?

pront avatar Feb 06 '25 22:02 pront

Hi @ganelo and thank you for this PR. The make check-component-docs check failed. Can you take a look?

Hi @pront - this seems to be broken on master, as far as I can tell. In any case, nothing I've touched should have broken the ruby script. Not sure how best to proceed.

ganelo avatar Feb 06 '25 22:02 ganelo

Hi @ganelo and thank you for this PR. The make check-component-docs check failed. Can you take a look?

Hi @pront - this seems to be broken on master, as far as I can tell. In any case, nothing I've touched should have broken the ruby script. Not sure how best to proceed.

Can you share the failure? The script requires Ruby 3, which often trips people up.

jszwedko avatar Feb 06 '25 22:02 jszwedko

Hi @pront - this seems to be broken on master, as far as I can tell.

I verified that it is not broken master.

In any case, nothing I've touched should have broken the ruby script. Not sure how best to proceed.

This PR adds configuration fields and we rely on source code to generate docs. Hence, docs need to regenerated. As @jszwedko pointed out we have a dependency on Ruby 3. My version locally:

❯ ruby -v
ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]

pront avatar Feb 06 '25 23:02 pront

Ah okay, thanks for clarifying! Will try again tomorrow with the correct version of Ruby to get to the bottom of the failure.

On Thu, Feb 6, 2025, 18:37 Pavlos Rontidis @.***> wrote:

Hi @pront https://github.com/pront - this seems to be broken on master, as far as I can tell.

I verified that it is not broken master.

In any case, nothing I've touched should have broken the ruby script. Not sure how best to proceed.

This PR adds configuration fields and we rely on source code to generate docs. Hence, docs need to regenerated. As @jszwedko https://github.com/jszwedko pointed out we have a dependency on Ruby 3. My version locally:

❯ ruby -v ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]

— Reply to this email directly, view it on GitHub https://github.com/vectordotdev/vector/pull/22366#issuecomment-2641355696, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4OKWXU3ELUEEVUJQVLFDD2OPW3BAVCNFSM6AAAAABWPOJTRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBRGM2TKNRZGY . You are receiving this because you were mentioned.Message ID: @.***>

ganelo avatar Feb 07 '25 00:02 ganelo

Looks better now 👍 Is it possible to add a test to cover the acks feature? You can draw inspiration from other sources e.g. the file source: https://github.com/vectordotdev/vector/blob/b90715bd452e03f4ca151114f5475daca3d99832/src/sources/file.rs#L2375-L2376

pront avatar Feb 07 '25 15:02 pront

Looks better now 👍 Is it possible to add a test to cover the acks feature? You can draw inspiration from other sources e.g. the file source:

https://github.com/vectordotdev/vector/blob/b90715bd452e03f4ca151114f5475daca3d99832/src/sources/file.rs#L2375-L2376

To confirm - there aren't any existing tests that actually exercise the kubernetes_logs source, right? I poked around and didn't find much that did anything beyond some niche unit tests. Certainly nothing to mirror the run_file_source fn of file.rs.

I'm not utterly opposed to writing one from more or less whole cloth, but it will be somewhat slow going as I'm a relative newcomer to Rust.

ganelo avatar Feb 07 '25 17:02 ganelo

@pront after a lot of off-and-on effort, finally managed to get some working tests added - note that the current tests require permissions to write to /var/log since that's where the kube crate expects kubernetes logs to be. if that's a problem, I'm not sure how to get around that requirement

ganelo avatar Apr 18 '25 16:04 ganelo

@pront after a lot of off-and-on effort, finally managed to get some working tests added - note that the current tests require permissions to write to /var/log since that's where the kube crate expects kubernetes logs to be. if that's a problem, I'm not sure how to get around that requirement

Maybe I don't know the implementation well enough, but could we use let data_dir = tempdir().unwrap(); and set include_paths_glob_patterns to that?

pront avatar Apr 28 '25 19:04 pront

@pront after a lot of off-and-on effort, finally managed to get some working tests added - note that the current tests require permissions to write to /var/log since that's where the kube crate expects kubernetes logs to be. if that's a problem, I'm not sure how to get around that requirement

Maybe I don't know the implementation well enough, but could we use let data_dir = tempdir().unwrap(); and set include_paths_glob_patterns to that?

I'm afraid not - data_dir only controls where vector writes its checkpoint files, the location that vector will look at to find the actual input kubernetes log files for the source is hardcoded here:

https://github.com/vectordotdev/vector/blob/689a65d51a2f8034f220afcf5e65e1b924019687/src/sources/kubernetes_logs/path_helpers.rs#L10

ganelo avatar May 14 '25 20:05 ganelo

@pront after a lot of off-and-on effort, finally managed to get some working tests added - note that the current tests require permissions to write to /var/log since that's where the kube crate expects kubernetes logs to be. if that's a problem, I'm not sure how to get around that requirement

Maybe I don't know the implementation well enough, but could we use let data_dir = tempdir().unwrap(); and set include_paths_glob_patterns to that?

I'm afraid not - data_dir only controls where vector writes its checkpoint files, the location that vector will look at to find the actual input kubernetes log files for the source is hardcoded here:

https://github.com/vectordotdev/vector/blob/689a65d51a2f8034f220afcf5e65e1b924019687/src/sources/kubernetes_logs/path_helpers.rs#L10

I see, is there anything preventing us from refactoring this? Tests requiring permissions to /var/log isn't great.

A hackier alternative is to provide different values based on #[cfg(test)] and #[cfg(not(test))].

pront avatar May 15 '25 13:05 pront

@pront after a lot of off-and-on effort, finally managed to get some working tests added - note that the current tests require permissions to write to /var/log since that's where the kube crate expects kubernetes logs to be. if that's a problem, I'm not sure how to get around that requirement

Maybe I don't know the implementation well enough, but could we use let data_dir = tempdir().unwrap(); and set include_paths_glob_patterns to that?

I'm afraid not - data_dir only controls where vector writes its checkpoint files, the location that vector will look at to find the actual input kubernetes log files for the source is hardcoded here: https://github.com/vectordotdev/vector/blob/689a65d51a2f8034f220afcf5e65e1b924019687/src/sources/kubernetes_logs/path_helpers.rs#L10

I see, is there anything preventing us from refactoring this? Tests requiring permissions to /var/log isn't great.

A hackier alternative is to provide different values based on #[cfg(test)] and #[cfg(not(test))].

That's fair - pushed up a change which plumbs through a temp dir for testing.

ganelo avatar May 15 '25 20:05 ganelo

Hey @pront - want to make sure you saw the latest changes here :)

ganelo avatar May 27 '25 18:05 ganelo

@pront - I hesitate to close this out after so much iteration, but I do want to be up front that we've struggled to enable this in our environment. The behavior that we've been observing is bursts of full throughput interspersed with long periods of absolutely no events coming through. From some reading of the original PR that added support for file-based acknowledgements (https://github.com/vectordotdev/vector/pull/7816), I attribute this to perf concerns with the underlying implementation of acknowledgements for the file source, which obviously this PR just extends to the kubernetes_logs source.

So, bottom line is that we will probably not use this feature, but since presumably folks were comfortable merging https://github.com/vectordotdev/vector/pull/7816 despite the perf concerns, I think this PR is in a similar boat. I leave it to you and/or other Vector devs as to whether to proceed with this PR as written or not.

ganelo avatar Jun 16 '25 15:06 ganelo

@pront - I hesitate to close this out after so much iteration, but I do want to be up front that we've struggled to enable this in our environment. The behavior that we've been observing is bursts of full throughput interspersed with long periods of absolutely no events coming through. From some reading of the original PR that added support for file-based acknowledgements (#7816), I attribute this to perf concerns with the underlying implementation of acknowledgements for the file source, which obviously this PR just extends to the kubernetes_logs source.

So, bottom line is that we will probably not use this feature, but since presumably folks were comfortable merging #7816 despite the perf concerns, I think this PR is in a similar boat. I leave it to you and/or other Vector devs as to whether to proceed with this PR as written or not.

Hi @ganelo, thank you for your efforts on this PR. The Vector team appreciates it. If nothing else, there's good test enhancement and code refactoring in this. Ideally we want those. Also, we still go ahead with it but we should add a note in the changelog and in the K8s logs source docs that this is in beta and mention the performance degradation you observed.

pront avatar Jun 20 '25 17:06 pront