VictoriaMetrics icon indicating copy to clipboard operation
VictoriaMetrics copied to clipboard

ingestion: make stream processing for influx protocol optional

Open hagen1778 opened this issue 1 year ago • 7 comments

Is your feature request related to a problem? Please describe

VictoriaMetrics processes data ingested in influxdb protocol in a streaming fashion. This allows to reduce memory usage and improve ingestion performance at the same time. However, there are cases when streaming could become harmful. For example, clients like telegraf could start retrying ingesting batches of data if ingestion request failed in the middle. This would mean that VM already processed and persisted 50% of the batch, and now it would need to do this again on the next retry attempt from telegraf. In some cases, this situation can develop in a snowballing of ingest requests from telegraf which just can't be processed in full.

Describe the solution you'd like

It would be great to have an option to disable streaming mode for HTTP (not UDP) influx protocol. This is especially important for VM installations that run with strict limit on resources.

hagen1778 avatar Sep 25 '24 07:09 hagen1778

@AndrewChubatiuk , the stream parsing mode must be disabled by default for HTTP-based data ingestion for Influx line protocol, since this is the major source of issues for Telegraf users. It should support stream_mode=1 http query arg, which could enable stream parsing mode.

When stream parsing mode is disabled, then the whole request body must be read in memory before starting to parse it. The success response must be sent to the client (e.g. Telegraf) immediately after reading the request body. After that the request body must be parsed and saved in the storage. Partial request body must be rejected (for example, if Telegraf closes the connection in the middle of sending of request body because of timeout).

Stream parsing mode should remain enabled for TCP-based data ingestion, since it is common practice to send unlimited amounts of lines over a single TCP connection, so they should be parsed and pushed to the storage as fast as possible.

valyala avatar Sep 30 '24 09:09 valyala

When stream parsing mode is disabled, then the whole request body must be read in memory before starting to parse it. The success response must be sent to the client (e.g. Telegraf) immediately after reading the request body.

@valyala This would mean that if there are parsing errors - the client won't ever know about this. If we're doing it like batch-processing, should we return ok only after everything was parsed and scheduled for ingestion? This would have performance impact, of course. But this is expected behavior for per-batch processing, and something many users asked before in tickets like this https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4446.

hagen1778 avatar Oct 10 '24 09:10 hagen1778

This was discussed internally and we agreed that with batch processing VM should respond to user only after parsing is finished. The whole batch should be rejected and client should get error in response if at least one row in the batch failed to parse. In this case, VM should also respond with non-retryable status code 422 (http.StatusUnprocessableEntity) to the client. Unfortunately, telegraf doesn't have default settings for non-retryable requests, it needs to be configured manually - see non_retryable_statuscodes in https://github.com/influxdata/telegraf/blob/master/plugins/outputs/http/README.md#configuration. We should mention this in docs. The error message returned to the client should also mention possibles ways to mitigate this, like enable stream mode to ignore bad lines. cc @AndrewChubatiuk

hagen1778 avatar Oct 11 '24 10:10 hagen1778

@hagen1778 should response contains information about all failed rows or just one is enough?

AndrewChubatiuk avatar Oct 11 '24 11:10 AndrewChubatiuk

@hagen1778 should response contains information about all failed rows or just one is enough?

I think, the first encountered error is enough.

hagen1778 avatar Oct 11 '24 11:10 hagen1778

thought telegraf users relying on influxdb output, not http. This output doesn't retry any 4xx error

AndrewChubatiuk avatar Oct 11 '24 14:10 AndrewChubatiuk

Pull request https://github.com/VictoriaMetrics/VictoriaMetrics/pull/7165 has been merged and will be available in the next release. Starting from new release, ingestion via influx protocol will be processed in batch mode by default. To enable stream mode Stream-Mode: "1" HTTP header needs to be passed with each request. See more details at https://docs.victoriametrics.com/#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf

the stream parsing mode must be disabled by default for HTTP-based data ingestion for Influx line protocol, since this is the major source of issues for Telegraf users. It should support stream_mode=1 http query arg, which could enable stream parsing mode.

@valyala we switched to using HTTP header instead of query param because telegraf's output outputs.influxdb doesn't allow setting custom GET params (or overrides them if they set in URL). But it supports HTTP headers.

Re-opening this ticket until it is released.

hagen1778 avatar Oct 15 '24 09:10 hagen1778

The feature was included into v1.105.0

f41gh7 avatar Oct 21 '24 17:10 f41gh7