splunk-connect-for-syslog icon indicating copy to clipboard operation
splunk-connect-for-syslog copied to clipboard

grouping-by / aggregate feature log loss

Open olivierpas opened this issue 11 months ago • 4 comments

Was the issue replicated by support? No What is the sc4s version ? 3.22.1 Is there a pcap available? No Is the issue related to the environment of the customer or Software related issue? No Is it related to Data loss, please explain ? Protocol? Hardware specs? Yes, i am using grouping-by and the sum of all "repetition" field is not equivalent if i change the timeout value from my rewriter.

Last chance index/Fallback index? Yes Is the issue related to local customization? Yes Do we have all the default indexes created? Yes Describe the bug A clear and concise description of what the bug is.

When i change the timeout from 60 to 180 it devide the _raw count and also the repetition sum by 3. repetition field is calculate each time a log match.

image

To Reproduce Steps to reproduce the behavior: Change the timeout value

app-dest-rewrite-pan_panos.conf.txt

olivierpas avatar Mar 06 '24 12:03 olivierpas

hi @olivierpas if that's your custom parser and the problems is directly with the syslog-ng DSL we can help on the best effort basis. Please provide the minimal reproducible example

mstopa-splunk avatar Mar 08 '24 10:03 mstopa-splunk

Hello @mstopa-splunk , and thank you for your reponse. what do you need exacly ? logs are pan:traffic sourcetype basis ( Palo Alto)

olivierpas avatar Mar 08 '24 12:03 olivierpas

@olivierpas I wondered if there was a way to recreate this in my lab but let's start on your side.

So count(_raw) != sum(repetition). But both correctly dropped to ~30% after 10:00, when you changed the timeout. This looks correct, so why is changing the timeout important for this case?

If the problem is that count(_raw) != sum(repetition), please send the same graph but by SC4S containers to make sure that all your SC4S instances include app-dest-rewrite-paloalto_panos-d_fmt_hec_default.

If the problem is about timeout, please explain

mstopa-splunk avatar Mar 08 '24 14:03 mstopa-splunk

@mstopa-splunk yes the config is the same on all SC4S instance. The only issue is the difference of volume after the change of timeout value. i was wrong with my first request. The expecting result is to have the sum(total) more important than the count(_raw) value by changing the timeout from 60 to 180. The goal is the aggregate similar logs, not loosing them. image

olivierpas avatar Mar 08 '24 16:03 olivierpas

take a look what I can be missing here:

  1. I tried to make a minimal reproducible example:
block parser app-dest-test-grouping-by() {
    channel {
        rewrite {
            r_set_splunk_dest_default(sourcetype("test:grouping-by"));
            set("t_kv_values", value(".splunk.sc4s_template"));
        };

        parser {
            grouping-by (
                key("${HOST}")
                aggregate(
                    value(".values.repetition" "$(- $(context-length) 1)")
                    inherit-mode(context)
                    tags("isGrouped")
                )
                # timeout(2)
                # timeout(6)
                timeout(12)
            );
        };
    };
};

application app-dest-test-grouping-by[sc4s-lp-dest-format-d_hec_fmt] {
    parser {
       app-dest-test-grouping-by();
    };
};

Then I was sending an event per second, but each 10 seconds I did 10 seconds break:

#!/bin/bash

for ((hour=0; hour<1; hour++)); do
    for ((minute=0; minute<60; minute++)); do
        for ((second=0; second<60; second++)); do
            if ((second > 0 && second % 10 == 0)); then
                sleep 10
            else
                sleep 1
            fi
            echo $second
            echo "hello world" > /dev/udp/0.0.0.0/514
        done
    done
done

timeout=2 works good: image

timeout=6 works good: image

timeout=12 is never reached because max_break==10 seconds so it keeps accumulating (https://axoflow.com/docs/axosyslog-core/chapter-correlating-log-messages/grouping-by-parser/grouping-by-parser-options/#grouping-by-parser-timeout)

Is it possible that some of your aggregations for "${.values.src_ip}/${.values.dest_ip}/${.values.src_port}/${.values.dest_port}" are for events with no breaks >= 180 and these are stuck and never reach Splunk because timeout is reset with every new event?

mstopa-splunk avatar Mar 11 '24 13:03 mstopa-splunk

Hello, Thank you for your work. So your suggestion is that syslog-ng never release the aggregated log because the timeout is to high ? Ok i will try to reduce it and let you know.

olivierpas avatar Mar 11 '24 13:03 olivierpas

So your suggestion is that syslog-ng never release the aggregated log because the timeout is to high ?

Exactly. When you use my conf and script, but never do a break, syslog-ng aggregates forever. When you break the sending process, syslog-ng waits until timeout and only then releases the aggregated message. Every new event resets the timeout and this is confirmed in their docs.

mstopa-splunk avatar Mar 11 '24 14:03 mstopa-splunk

all right, for now I'm closing this issue as solved

mstopa-splunk avatar Mar 12 '24 09:03 mstopa-splunk