splunk-connect-for-syslog
splunk-connect-for-syslog copied to clipboard
grouping-by / aggregate feature log loss
Was the issue replicated by support? No What is the sc4s version ? 3.22.1 Is there a pcap available? No Is the issue related to the environment of the customer or Software related issue? No Is it related to Data loss, please explain ? Protocol? Hardware specs? Yes, i am using grouping-by and the sum of all "repetition" field is not equivalent if i change the timeout value from my rewriter.
Last chance index/Fallback index? Yes Is the issue related to local customization? Yes Do we have all the default indexes created? Yes Describe the bug A clear and concise description of what the bug is.
When i change the timeout from 60 to 180 it devide the _raw count and also the repetition sum by 3. repetition field is calculate each time a log match.
To Reproduce Steps to reproduce the behavior: Change the timeout value
hi @olivierpas if that's your custom parser and the problems is directly with the syslog-ng DSL we can help on the best effort basis. Please provide the minimal reproducible example
Hello @mstopa-splunk , and thank you for your reponse. what do you need exacly ? logs are pan:traffic sourcetype basis ( Palo Alto)
@olivierpas I wondered if there was a way to recreate this in my lab but let's start on your side.
So count(_raw)
!= sum(repetition)
.
But both correctly dropped to ~30% after 10:00, when you changed the timeout. This looks correct, so why is changing the timeout important for this case?
If the problem is that count(_raw)
!= sum(repetition)
, please send the same graph but by SC4S containers to make sure that all your SC4S instances include app-dest-rewrite-paloalto_panos-d_fmt_hec_default
.
If the problem is about timeout, please explain
@mstopa-splunk yes the config is the same on all SC4S instance.
The only issue is the difference of volume after the change of timeout value.
i was wrong with my first request.
The expecting result is to have the sum(total) more important than the count(_raw) value by changing the timeout from 60 to 180.
The goal is the aggregate similar logs, not loosing them.
take a look what I can be missing here:
- I tried to make a minimal reproducible example:
block parser app-dest-test-grouping-by() {
channel {
rewrite {
r_set_splunk_dest_default(sourcetype("test:grouping-by"));
set("t_kv_values", value(".splunk.sc4s_template"));
};
parser {
grouping-by (
key("${HOST}")
aggregate(
value(".values.repetition" "$(- $(context-length) 1)")
inherit-mode(context)
tags("isGrouped")
)
# timeout(2)
# timeout(6)
timeout(12)
);
};
};
};
application app-dest-test-grouping-by[sc4s-lp-dest-format-d_hec_fmt] {
parser {
app-dest-test-grouping-by();
};
};
Then I was sending an event per second, but each 10 seconds I did 10 seconds break:
#!/bin/bash
for ((hour=0; hour<1; hour++)); do
for ((minute=0; minute<60; minute++)); do
for ((second=0; second<60; second++)); do
if ((second > 0 && second % 10 == 0)); then
sleep 10
else
sleep 1
fi
echo $second
echo "hello world" > /dev/udp/0.0.0.0/514
done
done
done
timeout=2 works good:
timeout=6 works good:
timeout=12 is never reached because max_break==10 seconds so it keeps accumulating (https://axoflow.com/docs/axosyslog-core/chapter-correlating-log-messages/grouping-by-parser/grouping-by-parser-options/#grouping-by-parser-timeout)
Is it possible that some of your aggregations for "${.values.src_ip}/${.values.dest_ip}/${.values.src_port}/${.values.dest_port}"
are for events with no breaks >= 180 and these are stuck and never reach Splunk because timeout is reset with every new event?
Hello, Thank you for your work. So your suggestion is that syslog-ng never release the aggregated log because the timeout is to high ? Ok i will try to reduce it and let you know.
So your suggestion is that syslog-ng never release the aggregated log because the timeout is to high ?
Exactly. When you use my conf and script, but never do a break, syslog-ng aggregates forever. When you break the sending process, syslog-ng waits until timeout and only then releases the aggregated message. Every new event resets the timeout and this is confirmed in their docs.
all right, for now I'm closing this issue as solved