crowdsec icon indicating copy to clipboard operation
crowdsec copied to clipboard

Add retry to VictoriaLogs data source

Open thebondo opened this issue 7 months ago • 8 comments

The current VictoriaLogs data source will exit if the data source goes away once a tail based connection has been established. This causes the whole crowdsec service to exit as well.

This update fixes the Tail method on the VictoriaLogs client so that it tries to reconnect with a backoff if the connection is lost.

Resolves #3653

thebondo avatar May 30 '25 13:05 thebondo

@thebondo: There are no 'kind' label on this PR. You need a 'kind' label to generate the release automatically.

  • /kind feature
  • /kind enhancement
  • /kind refactoring
  • /kind fix
  • /kind chore
  • /kind dependencies
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

github-actions[bot] avatar May 30 '25 13:05 github-actions[bot]

@thebondo: There are no area labels on this PR. You can add as many areas as you see fit.

  • /area agent
  • /area local-api
  • /area cscli
  • /area appsec
  • /area security
  • /area configuration
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

github-actions[bot] avatar May 30 '25 13:05 github-actions[bot]

/kind fix /area agent

thebondo avatar May 30 '25 13:05 thebondo

Hello @thebondo, do you think you will have some time to continue work on this PR? If not, I can take over and create a PR with your commit and get it ready for review (attributing you as an original author).

zekker6 avatar Jul 11 '25 04:07 zekker6

Hello @zekker6. Yes, I can still contribute on this. I have a question about the approach for the final fix.

As I understand it, the tail endpoint is expected to stay open and report log entries when received (after some possible delay, which helps keep entries sorted by time). However, the connection is sometimes lost. Ideally, the tail consumer would like to see all log entries exactly once. However, there does not seem to be a good way to start a new tail query that will both

  1. return all log entries received after that last query disconnected
  2. not return any log entries received before the last query disconnected

The solution I have so far is to keep track of the latest timestamp seen while the tail connection is open. If the connection is lost, then compute start_offset based on the time of the new query, the latest timestamp, and possible a small constant to account for potential delays in the request process.

This has the drawback that it can potentially provide a log entry more than once. From looking at the VictoriaLogs source, the start time for the tail interval appears to be calculated as time.Now() - offset - start_offset. That means that the consumer cannot specify the start time exactly because it involves the time the request is actually handled by VictoriaLogs (which is the reason for adding a small constant to start_offset).

Does this all sound correct to you? If so, I will update the code to use start_offset. I will also remove the use of the limit argument, which is not documented for the tail endpoint and does not appear to do anything either.

thebondo avatar Jul 17 '25 06:07 thebondo

@thebondo Thank you for an update!

Does this all sound correct to you? If so, I will update the code to use start_offset. I will also remove the use of the limit argument, which is not documented for the tail endpoint and does not appear to do anything either.

Yes, this sounds correct to me.

zekker6 avatar Jul 17 '25 10:07 zekker6

Hello @thebondo, I've recently noticed that my CrowdSec instance was affected by this and wanted to check if you're planning to finalize this PR any time soon? I would like to offer some help with updates discussed above if that will help.

zekker6 avatar Sep 12 '25 15:09 zekker6

For crowdsec 1.7.4+ we are in the process of converting datasources to a blocking Stream() API -- this means any error that is returned by the Stream function will automatically trigger a retry loop upstream. I'll see if anything needs to be handled differently for victorialogs, will keep this PR open but it's normally not the responsibility of the datasource module anymore.

mmetc avatar Nov 06 '25 12:11 mmetc