crowdsec icon indicating copy to clipboard operation
crowdsec copied to clipboard

CrowdSec exits if the VictoriaLogs data source goes down temporarily

Open thebondo opened this issue 7 months ago • 6 comments

What happened?

I have a setup using fluent-bit to collect logs, ship them to a central VictoriaLogs instance, and then CrowdSec is ingesting logs from VictoriaLogs. Everything works great when started in the right order. CrowdSec will only start if VictoriaLogs is already running.

However, if VictoriaLogs goes down temporarily, CrowdSec dies.

What did you expect to happen?

I expected that CrowdSec would notice the lost connection, and at least retry a few times before dying. I hoped it would just keep retrying with some kind of back-off strategy.

How can we reproduce it (as minimally and precisely as possible)?

Requires a working docker-compose setup.

  1. Create a working folder.
  2. Create a file vlogs.yaml with these contents
source: victorialogs
mode: tail
log_level: info
url: http://vlogs:9428/
limit: 10
query: '*'
labels:
  type: other
  1. Create a file compose.yaml with these contents
services:
  vlogs:
    image: victoriametrics/victoria-logs:v1.23.2-victorialogs
  fluentbit:
    image: fluent/fluent-bit:4.0.2
    depends_on:
      - "vlogs"
    command:
      - "-i dummy"
      - "-p dummy='{_msg=\"This is a test message\"}'"
      - "-o http"
      - "-p host=vlogs"
      - "-p port=9428"
      - "-p uri=/insert/jsonline?_stream_fields=host,service&_msg_field=log&_time_field=date"
      - "-p format=json_lines"
      - "-p json_date_format=iso8601"
  crowdsec:
    image: crowdsecurity/crowdsec:v1.6.8
    depends_on:
      - "vlogs"
    volumes:
      - ./acquis.yaml:/etc/crowdsec/acquis.d/vlogs.yaml
  1. Run docker-compose up

This should start three containers, with fluent-bit generating test log messages, VictoriaLogs collecting them, and CrowdSec ingesting them. The CrowdSec init takes a short delay, but eventually, you should just see lines from the fluent-bit container for each record.

  1. At that point, in another terminal run docker stop to stop the VictoriaLogs container.

You should see the VictoriaLogs container stop after a short delay (it tries to quit gracefully, but the CrowdSec request is hanging on, so it waits for a time out). Just after VictoriaLogs exits, the CrowdSec container will also exit with an error. You will see fluent-bit chugging along trying to send data, but failing.

  1. Cleanup by running docker-compose down

Anything else we need to know?

I have forked crowdsecurity/crowdsec and create a branch victorialogs-retry-bug with a fix that is working for me.

Crowdsec version

$ cscli version
version: v1.6.8-f209766e
Codename: alphaga
BuildDate: 2025-03-25_15:56:53
GoVersion: 1.24.1
Platform: docker
libre2: C++
User-Agent: crowdsec/v1.6.8-f209766e-docker
Constraint_parser: >= 1.0, <= 3.0
Constraint_scenario: >= 1.0, <= 3.0
Constraint_api: v1
Constraint_acquis: >= 1.0, < 2.0
Built-in optional components: cscli_setup, datasource_appsec, datasource_cloudwatch, datasource_docker, datasource_file, datasource_http, datasource_journalctl, datasource_k8s-audit, datasource_kafka, datasource_kinesis, datasource_loki, datasource_s3, datasource_syslog, datasource_victorialogs, datasource_wineventlog

OS version

These are on the host running Docker. My actual setup does not use Docker, and all the programs are running natively. The same thing happens there as well.
# On Linux:
$ cat /etc/os-release
NAME="Pop!_OS"
VERSION="22.04 LTS"
ID=pop
ID_LIKE="ubuntu debian"
PRETTY_NAME="Pop!_OS 22.04 LTS"
VERSION_ID="22.04"
HOME_URL="https://pop.system76.com"
SUPPORT_URL="https://support.system76.com"
BUG_REPORT_URL="https://github.com/pop-os/pop/issues"
PRIVACY_POLICY_URL="https://system76.com/privacy"
VERSION_CODENAME=jammy
UBUNTU_CODENAME=jammy
LOGO=distributor-logo-pop-os
$ uname -a
Linux holmes 6.12.10-76061203-generic #202412060638~1743109366~22.04~1fce33b SMP PREEMPT_DYNAMIC Thu M x86_64 x86_64 x86_64 GNU/Linux

Enabled collections and parsers

Acquisition config

No response

Config show

No response

Prometheus metrics

No response

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

No response

thebondo avatar May 30 '25 00:05 thebondo

@thebondo: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

github-actions[bot] avatar May 30 '25 00:05 github-actions[bot]

I am willing to help work on this by creating a pull request. I have a working solution running on my local machines (this is all just on personal machines).

@zekker6 You started the original addition of the VictoriaLogs data source. Any thoughts? @blotus I asked about this on Discord, and you mentioned that the existing behavior was not the desired behavior.

Fork with Fixes

thebondo avatar May 30 '25 00:05 thebondo

@thebondo Thank you for detailed description of an issue and suggesting a fix!

An overall idea to retry tail requests looks good to me. Implementation wise, lc.currentTickerInterval should not be used for tail mode. The logic with ticker is used in case using mode: cat and using streaming acquisition. In this case client will poll results and use ticker to keep track of polling intervals.

With live tailing (streaming mode) this is not needed as the response is streaming results as soon as they appear. If there is no results the connection will still remain opened until results will be available. I would suggest to replace error handling logic based on ticker with using shouldRetry() function. As this is not obvious, it would be great to add a comment to ticker field to explain that it is supposed to be used in range queries. I would be happy to send a separate PR with this if you prefer, just let me know.

zekker6 avatar May 30 '25 06:05 zekker6

@zekker6 Thanks for the feedback.

I have updated the retry method to not use a ticker, that makes sense. I avoided using shouldRetry because I believe the goal is to have it retry "forever". Instead, I added a variable local to doTail (backoffInterval) that is used with time.After if the connection is lost (readResponse returns). To keep the logic a bit simpler, I added the backoff wait to the beginning of the loop so the various failure checks could still use continue to start the next cycle.

I have also created a pull request for this.

Also, sorry if the process I am using is weird. I have never contributed to any open source projects before.

thebondo avatar May 30 '25 13:05 thebondo

@zekker My fix still needs work, and I have a question about VictoriaLogs that you may be able to answer.

The /select/logsql/tail endpoint does not appear to support the start parameter to set the start point for logs to be returned. I mixed up the documentation from VictoriaMetrics and VictoriaLogs. So it looks like there is no way to say "tail query starting at time T", but only "tail query with a starting offset of duration D". Is that correct? I will need to adjust the query to use start_offset with a duration instead of start with a time.

My bigger question is, does the tail endpoint always return logs in time order?

thebondo avatar Jun 05 '25 14:06 thebondo

@thebondo

So it looks like there is no way to say "tail query starting at time T", but only "tail query with a starting offset of duration D". Is that correct?

Yes, this is correct. And since this is correct it seems like this line is obsolete and should be replaced with something like this:

diff --git c/pkg/acquisition/modules/victorialogs/internal/vlclient/vl_client.go i/pkg/acquisition/modules/victorialogs/internal/vlclient/vl_client.go
index 8e1f9a28..ae3342db 100644
--- c/pkg/acquisition/modules/victorialogs/internal/vlclient/vl_client.go
+++ i/pkg/acquisition/modules/victorialogs/internal/vlclient/vl_client.go
@@ -336,9 +336,9 @@ func (lc *VLClient) doTail(ctx context.Context, uri string, c chan *Log) error {
 func (lc *VLClient) Tail(ctx context.Context) (chan *Log, error) {
        t := time.Now().Add(-1 * lc.config.Since)
        u := lc.getURLFor("select/logsql/tail", map[string]string{
-               "limit": strconv.Itoa(lc.config.Limit),
-               "start": t.Format(time.RFC3339Nano),
-               "query": lc.config.Query,
+               "limit":        strconv.Itoa(lc.config.Limit),
+               "start_offset": lc.config.Since.String(),
+               "query":        lc.config.Query,
        })
 
        c := make(chan *Log)

My bigger question is, does the tail endpoint always return logs in time order?

Logs are always sorted by time but it is possible for items with older timestamps to appear out of order. This is possible in case logs collection and delivery experiences delays. Tailing stores a timestamp of the last returned log item for each unique stream. in case tailing request matches multiple log streams and data for one of logs streams is delayed it might cause this older logs to apppear in between logs with more recent timestamps.

zekker6 avatar Jun 06 '25 05:06 zekker6