Splunk-Apps Suggest fields to remove from datamodel

So, because almost all of the data, including items which do not summarize well, are included in the data models, the data models that get created by this app are way too big. By way to big, I mean, generally they are larger than the original data. This can cause some major retention issues.

Aug 25 '16 14:08 automine

This is similar to issue #7. In 5.0.1 we reduced the number of fields in the datamodel to address performance and retention concerns. The Spunk Datamodel team helped in that effort to optimize the datamodel in the App. However, since each customer is different, it's very difficult to know which fields are more important and which are less important to be in the datamodel. We tried to strike a good balance, then allow the administrators to remove unneeded fields or add missing fields.

Are there specific fields you think should be removed from the datamodel in the App?

Aug 26 '16 22:08 btorresgil

I feel like the session_id and event_if fields are generally not used and and contribute to the problem. On more than one occasion I have seen instances where the DM accelerations for the palo data exceed the size of the indexed data and cause space issues. I'm sure there are others that do not need to be in the data model for the app to function.

Aug 29 '16 13:08 automine

The session_id field, while not easily summarized, is the only way to directly correlate logs in the same session. It's used in a few dashboards in the App, so unfortunately it can't really be removed. Not sure which is the 'event_if' field, I don't see that in the datamodel or log format.

I'm open to more recommendations for fields to remove. Happy to look at any and all suggestions.

Sep 01 '16 17:09 btorresgil

Just bumping this thread. I'll leave this open for now as a forum for suggested fields to remove from the datamodel.

If you'll be at Splunk .conf 2016, come find me at the Palo Alto Networks booth to talk more in person!

Sep 09 '16 15:09 btorresgil

Sorry, haven't had a good opportunity to look through it. Really, any of the extraneous fields that have high cardinality. I will be at conf. See you there!

On Fri, Sep 9, 2016, 11:26 AM Brian Torres-Gil [email protected] wrote:

Just bumping this thread. I'll leave this open for now as a forum for suggested fields to remove from the datamodel.

If you'll be at Splunk .conf 2016, come find me at the Palo Alto Networks booth to talk more in person!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PaloAltoNetworks/SplunkforPaloAltoNetworks/issues/28#issuecomment-245946644, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ9eviYkbHekhxEQY4Z435afXwCoD0biks5qoXqdgaJpZM4JtJkI .

Sep 09 '16 16:09 automine

Brian,

Did the suggestions Sanford Owings made to your data model after looking at my data ~2 years ago get made? Those should have significantly shrunk the Data Model.

...Chris Chris Kurtz Arizona State University (Personal Account)

On Fri, Sep 9, 2016 at 9:10 AM, David Shpritz [email protected] wrote:

Sorry, haven't had a good opportunity to look through it. Really, any of the extraneous fields that have high cardinality. I will be at conf. See you there!

On Fri, Sep 9, 2016, 11:26 AM Brian Torres-Gil [email protected] wrote:

Just bumping this thread. I'll leave this open for now as a forum for suggested fields to remove from the datamodel.

If you'll be at Splunk .conf 2016, come find me at the Palo Alto Networks booth to talk more in person!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/PaloAltoNetworks/SplunkforPaloAltoNetworks/ issues/28#issuecomment-245946644>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ AJ9eviYkbHekhxEQY4Z435afXwCoD0biks5qoXqdgaJpZM4JtJkI>

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PaloAltoNetworks/SplunkforPaloAltoNetworks/issues/28#issuecomment-245960310, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNMzWR5ycf1twb8ZFkABOyYeOu8P_Uaks5qoYUTgaJpZM4JtJkI .

Sep 09 '16 17:09 xoff00

Hey Chris, thanks for reaching out. Yes indeed, we took all the feedback and even vetted the datamodel with Splunk's Datamodel Product Management team to see if there were any other optimizations we could make. All those changes were released, so the latest versions of the App include them.

Sep 09 '16 17:09 btorresgil

Thanks Brian, see you at conf!

On Fri, Sep 9, 2016 at 10:07 AM, Brian Torres-Gil [email protected] wrote:

Hey Chris, thanks for reaching out. Yes indeed, we took all the feedback and even vetted the datamodel with Splunk's Datamodel Product Management team to see if there were any other optimizations we could make. All those changes were released, so the latest versions of the App include them.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PaloAltoNetworks/SplunkforPaloAltoNetworks/issues/28#issuecomment-245976167, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNMzfUaF1PnAOv4Swav65-Opf0RRJKpks5qoZJmgaJpZM4JtJkI .

Sep 09 '16 17:09 xoff00

No new comments for many months. Closing this issue. Feel free to comment even after closed. Thanks!

Jun 20 '17 15:06 btorresgil

This remains an issue. The accelerated data model size for this data tends to be very large, sometimes larger than the source data. I'm afraid I don't have the time to dig into the data model and review all of the fields, but i will say that at most of my customers I have to reduce the retention period significantly (sometimes as low as 7 days),

Jun 20 '17 16:06 automine

This has been an issue for several years. I don't believe that any of the suggested changes myself and Sanford Owings made were ever included.

Personally, we have avoided using the data model because of this, and now run it with extremely short retention because of the issue.

On Tue, Jun 20, 2017 at 9:49 AM, David Shpritz [email protected] wrote:

This remains an issue. The accelerated data model size for this data tends to be very large, sometimes larger than the source data. I'm afraid I don't have the time to dig into the data model and review all of the fields, but i will say that at most of my customers I have to reduce the retention period significantly (sometimes as low as 7 days),

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PaloAltoNetworks/SplunkforPaloAltoNetworks/issues/28#issuecomment-309819759, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNMzcBJ9E10F6xWqzY2iQNqZxA_bijLks5sF_gYgaJpZM4JtJkI .

Jun 20 '17 20:06 xoff00

Hello. I'm happy to re-open and reexamine this. Unfortunately the suggestion from @automine to remove session_id isn't possible because session_id is the only correlation point between the different log types. @xoff00 I believe your suggestions were implemented in 5.0.1, but please let us know if we missed any.

Splunk has taken responsibility for large summary indexes by adding the TSIDX Reduction feature in Splunk 6.4 which improves retention of summary indexes (like datamodel acceleration). More information at this link, take a look and see if it will work for your environment: http://docs.splunk.com/Documentation/Splunk/6.4.0/Indexer/Reducetsidxdiskusage

What I'm looking for is concrete suggestions on fields to remove. We'll take any suggestion, so tell us which fields you never use and they will be considered for removal from the datamodel.

Thanks! -Brian

Jun 21 '17 00:06 btorresgil

That feature doesn't do what you think it does. That feature is for dropping the metadata from indexed data to reduce space usage by indexed data, but isn't meant for use with accelerated data models, and I don't believe it will help with the reduction of the dma size. I'll see if I can get a look at the data model in the app to try to address the issues, however, my local deployment doesn't have live PAN data.

Jun 21 '17 00:06 automine

You're right, TSIDX Reduction does not apply to datamodel acceleration. Thanks for the correction.

Looking forward to your help in reducing the fields in the datamodel. Let me know what you come up with. If it helps, we have a docker image with demo data you can look at. If you have docker installed, just type:

docker run -d --name splunk-demo -p 8000:8000 btorresgil/splunk-panw-demo

Then connect to http://localhost:8000

And to remove and clean up the container:

docker rm -fv splunk-demo

Hope that helps

Jun 21 '17 00:06 btorresgil

Just some anecdotal evidence, we checked one of our customers' deployment to grab some numbers. They have 30 days of PAN data, which is taking up about 2tb of space on disk. The accelerations for the pan_firewall DM, accelerated for 7 days, takes 2.6tb.

Jun 21 '17 15:06 automine

Hi guys, I've been investigating this and have some findings, ideas, and questions.

Findings:

There are some fields that are functionally duplicate. We'll remove the extra data that can be derived from other data.
There are several fields from lookup tables which are getting summarized in the datamodel, which likely explains why the summary index is larger than the original index, because the lookup data is getting pulled into each event and summarized.

Ideas: I'm considering removing the following fields from the datamodel:

cmd (duplicates command)
serial (duplicates serial_number)
vsys (duplicates virtual_system)
rule_name (duplicates rule)
tag
bytes_in
bytes_out
packets_in
packets_out
src_class (from a lookup table)
dest_class (from a lookup table)
dvc
misc
major_content_type
app: capable of file transfer (from a lookup table)
app: evasive (from a lookup table)
app: excessive bandwidth (from a lookup table)
app: has known vulnerability (from a lookup table)
app: pervasive use (from a lookup table)
app: prone to misuse (from a lookup table)
app: tunnels other apps (from a lookup table)
app: used by malware (from a lookup table)
protocol (duplicates transport)
server_ip or src_ip (also location, etc)
src/dest_interface
src/dest_zone
src/dest_translated_ip/port
src/dest_ip (or src/dest?)

Questions:

Which of these fields would you like to keep in the datamodel? If you don't speak up, they will be removed. Especially curious about src/dest_interface, src/dest_zone, src/dest_translated_ip/port. Do you use these?
For the bytes and packets fields, we could keep bytes and packets or we could keep the more specific fields bytes_in, bytes_out, packets_in, packets_out. Would you rather save space with the less specific fields, or would you rather know the in and out specifics at the expense of disk space? Keep in mind these are high-cardinality fields.
I'm planning to remove half of the app: etc etc fields and leave some that are more commonly used. I can remove them all which would decrease the summary index size, and would have 2 consequences: You would have to use inputlookup command to use this app metadata in a dashboard, and you would not be able to use the pivot feature to see any of this data.
Right now all firewall logs are in the pan_firewall datamodel. Would it be beneficial to break the SYSTEM and CONFIG logs into a separate pan_firewall_operations datamodel? This would allow your high rate logs (pan_traffic and pan_threat) to have a different acceleration retention setting than low rate operational logs (pan_system and pan_config). Would this be useful or helpful?

Let me know your thoughts on the above. Thanks! -Brian

Jun 29 '17 04:06 btorresgil

I think that will help, but the overall problem is the high cardinality of the fields that are included in the data model. I checked out the docker container, but I don't think the unique ID fields are being randomized (I may be wrong, but I had limited time to take a look), which means that the cardinality of the data isn't going to be sufficient to reflect what happens IRL.

Jun 29 '17 16:06 automine

These seem like good choices. I'm not sure we'd ever use the src subfields at all, especially in the Model.

I might swap the common bytes for bytes_in and bytes_out (or just keep all 3) as those would give you a better indication or direction for netcat-style reverse tunnels, etc. I think packets_(in|out) is of more limited use and packets is fine.

...Chris

On Wed, Jun 28, 2017 at 9:50 PM, Brian Torres-Gil [email protected] wrote:

Hi guys, I've been investigating this and have some findings, ideas, and questions.

Findings:

There are some fields that are functionally duplicate. We'll remove the extra data that can be derived from other data.

There are several fields from lookup tables which are getting summarized in the datamodel, which likely explains why the summary index is larger than the original index, because the lookup data is getting pulled into each event and summarized.

Ideas: I'm considering removing the following fields from the datamodel:

cmd (duplicates command)

serial (duplicates serial_number)

vsys (duplicates virtual_system)

rule_name (duplicates rule)

tag

bytes_in

bytes_out

packets_in

packets_out

src_class (from a lookup table)

dest_class (from a lookup table)

dvc

app: capable of file transfer (from a lookup table)

app: evasive (from a lookup table)

app: excessive bandwidth (from a lookup table)

app: has known vulnerability (from a lookup table)

app: pervasive use (from a lookup table)

app: prone to misuse (from a lookup table)

app: tunnels other apps (from a lookup table)

app: used by malware (from a lookup table)

protocol (duplicates transport)

server_ip or src_ip (also location, etc)

src/dest_interface

src/dest_zone

src/dest_translated_ip/port

src/dest_ip (or src/dest?)

Questions:

Which of these fields would you like to keep in the datamodel? If you don't speak up, they will be removed. Especially curious about src/dest_interface, src/dest_zone, src/dest_translated_ip/port. Do you use these? 2.

For the bytes and packets fields, we could keep bytes and packets or we could keep the more specific fields bytes_in, bytes_out, packets_in, packets_out. Would you rather save space with the less specific fields, or would you rather know the in and out specifics at the expense of disk space? 3.

I'm planning to remove half of the app: etc etc fields and leave some that are more commonly used. I can remove them all which would decrease the summary index size, and would have 2 consequences: You would have to use inputlookup command to use this app metadata in a dashboard, and you would not be able to use the pivot feature to see any of this data.

Let me know your thoughts on the above. Thanks! -Brian

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PaloAltoNetworks/SplunkforPaloAltoNetworks/issues/28#issuecomment-311861758, or mute the thread https://github.com/notifications/unsubscribe-auth/ACNMzXV5tE8LQ-T-r6c-WBtVCd7uicP6ks5sIy0ZgaJpZM4JtJkI .

Jun 29 '17 16:06 xoff00

I am way late to this discussion and am not using the app, but would point out some use cases from the Network_Traffic CIM. 1)session_id: as pointed out, this correlates log records for the same session. I have noticed that my PAN firewall can generate several start and end events for the same session, so session_id is critical to summarize each connection 2)bytes_out, bytes_in. I have been finding these values to benchmark typical traffic in my network, For example, I can use the following query to track normal sent traffic to a destination and use that as the basis for traffic that deviates heavily from normal (mind the typos): |tstats summariesonly=t max(bytes_out) as bytes_out max(bytes_in) as bytes_in from datamode=Network_Traffic.All_Traffic by session_id src_ip dest_ip src_port dest_port | stats avg(bytes_out) as avgBytesOut stdev(bytes_out) as stdevBytesOut by dest_port 3)dvc. I use this a way to distinguish remote offices with PAN firewalls that we manage. I can then tell that traffic is for a remote office vs the home office and easily create reports for each 4)src/dest zone: critical for understanding the flow of traffic between network segments

Like I said, I use the CIM Network_Traffic datamodel, the PAN app datamodel change would not affect me directly (maybe would affect my Network Infrastructure team who I am trying to get to use the PAN App though). I definitely agree with removing the duplicates though. We can learn to use what is available and if Splunk does not support aliases on datamodel fields, they probably should.

Nov 08 '17 04:11 MonkeyKa