prometheus_remote_write: Fix cutoff logic.
We noticed that fluent-bit’s prometheus remote_write output plugin was silently dropping some, but not all, process_exporter metrics after about one hour while the stdout output plugin was still showing metrics being collected. We were also able to reduce the time after which metrics were being dropped by modifying CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_THRESHOLD, which indicates the problem is the cutoff logic. This merge-request treats CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_ERROR as success and continues encoding other metrics, so they do not get dropped. It might be worth dropping this “error” code entirely, since it’s not really an error and leads to subtle bugs like this one.
After merging this fix the bundled copy of cmetrics inside fluent-bit should be updated.
How is this information (whether some values were not transmitted due to the cutoff) used by fluent-bit? As far as I see cmt_encode_prometheus_remote_write_create always handled CMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_ERROR like a success and never reported anything to the upper layers (like fluent-bit). Is this something that needs to be changed?
How is this information (whether some values were not transmitted due to the cutoff) used by fluent-bit? As far as I see
cmt_encode_prometheus_remote_write_createalways handledCMT_ENCODE_PROMETHEUS_REMOTE_WRITE_CUTOFF_ERRORlike a success and never reported anything to the upper layers (like fluent-bit). Is this something that needs to be changed?
Currently, we don't use this error for reporting to fluent-bit plugins. This is because for code simplicity. And I once rethink this PR again, I realized that this should be enough for handling extra cutting off circumstances.