GettingStartedWithELK icon indicating copy to clipboard operation
GettingStartedWithELK copied to clipboard

DISCUSS: Standards

Open coolacid opened this issue 10 years ago • 20 comments

Given this project will work on "drop and go" filters for devices by type (ie: Input sets type to "ApacheCombined" and our filter is everything that needs to happen in Logstash for that type) We need to come up with a set of standards.

I'd like to discuss those standards here. ie:

Traffic Sources:

  • IP Address: src_ip or src.ip or ??
  • Port: src_port or ???

Some things to get started:

  • Traffic Sources (The source of some kind of traffic -- think firewall logs)
  • Traffic Destinations
  • Event Sources (The device that generated the event)
  • Counts (Packet, Bytes etc)

The concept is, the filter should modify any fields to the correct "standard". For example, a KV formated firewall log should have a stanza that renames fields to the correct "standard"

Thoughts and other things welcomed.

coolacid avatar Jul 21 '14 20:07 coolacid

My own internal convention has largely been underscore-based (ie source_ip, etc) - this is mostly because I can add dynamic mappings in Elasticsearch easily, without caring what the full path to the field is or otherwise worrying with configuring ES to find the field properly, so I can ensure that fields named *_ip are stored as IPs with a raw field that's an unanalyzed string, while *_count fields are longs without a second component, etc.

torrancew avatar Jul 21 '14 20:07 torrancew

Ok - so underscores are important (vs other options?) What about abbreviations - ie: src vs source.

What things do we need to consider in the standards?

coolacid avatar Jul 21 '14 20:07 coolacid

I generally opt for verbose/explicit where possible, so I'd tend to write "source_ip" (though I contradict myself and use "dest" rather than "destination", so take that with a grain of salt.

torrancew avatar Jul 21 '14 21:07 torrancew

verbose vs screen real estate -- this is why I opted for the src_* instead.

More comments welcomed -- and send to your feeds ;)

coolacid avatar Jul 21 '14 21:07 coolacid

src and dest are pretty standard abbreviations in UNIX. Basically if it's common in UNIX or Linux, then might as well follow the convention. I would just avoid new abbrevations. I would prefer checksum vs chksm, or rubydebug vs rbdbg.

If we're just talking about logstash conf files, then I would opt for:

  1. all lower case
  2. Underscores (_) over hyphens (-)
  3. 4 space tabs. (Of course, heh)

Regarding fields I don't have as much of a preference. I don't usually change field names.

shurane avatar Jul 23 '14 04:07 shurane

@shurane I want to change field names -- when I search for say a src_ip I'd like results from all logs for that IP.. I wouldn't want to have to build a query with multiple fields to say the same thing.

coolacid avatar Jul 28 '14 14:07 coolacid

From a programmer's perspective i love having sub fields ( src.ip src.port ) but like @torrancew mentioned its easier to match against fields ( *_ip , *_port ) although i would be sure that with the mappings these days it should be possible do do *.ip *.port ?

electrical avatar Jul 28 '14 14:07 electrical

I like the idea of sub fields too - especially with the possibility of Kibana supporting a tree of fields (not saying it will/does) but it would be cool -- Need to test *.ip *.port somehow ;)

coolacid avatar Jul 28 '14 14:07 coolacid

@electrical Agreed. I'm sure there's a way to template sub-fields, but I've found extending the template to be tedious and painful, at best, and haven't had the patience to nail it down for sub-fields.

torrancew avatar Jul 29 '14 21:07 torrancew

@coolacid If you like the idea of subfields, and your intention is to be able to parse/normalize a great deal of different log types, you could thing about prepending network related stuff with "net.*"

You could generate fields like "net.blocked", "net.src.ip", "net.dst.port", "net.l4proto", for instance. This is how I organize the normalized logs in MLSec Project.

This gives me some taxonomy flexibility when I am enriching stuff with passiveDNS data or other sources.

alexcpsec avatar Aug 01 '14 15:08 alexcpsec

Convo with @untergeek on IRC suggests subfields will be fine - so that's what we'll go with.

Comments on the net.* header -- I don't think it's a bad idea, it would allow us to break out other specific items, which I had todo with things like AV engines.

coolacid avatar Aug 01 '14 18:08 coolacid

If it helps as a reference here's the json standard we settled on in MozDef: http://mozdef.readthedocs.org/en/latest/usage.html#json-format very similar to this discussion (cept I hate underscores, but I'm getting over it).

The most helpful part was separating tiers of the event into standard and custom/Detail fields. Standard ones (category, severity, etc) are at the top level of the json doc along with a human readable 'summary' field (think syslog MSG). Details are the things you would parse out of the MSG or tack on if you have a custom event source (like cloudtrail, auditd, compliance data, vulnerability data, etc)

jeffbryner avatar Aug 08 '14 15:08 jeffbryner

Starting to put some suggestions in here:

https://github.com/coolacid/GettingStartedWithELK/wiki/Field-Standards

Feel free to start adding other ideas.

coolacid avatar Aug 14 '14 21:08 coolacid

Going to propose the following (beyond the above mentioned wiki describing Field data).

Type field should be the type of device sending the data - IE: Apache, nginx or what not primarily for filters in the logstash pipeline.

DataType field should be the Data type - IE: Firewall, AntiVirus, WebLog so that Kibana can push a single view for like data.

coolacid avatar Sep 24 '14 12:09 coolacid

@coolacid I like your idea of the DataType field, I'm currently using the tags field for this purpose, but I can definitely see the benefit of creating a seperate field for this.

I tend to agree with @torrancew about disliking uppercase field names. Keeping everything lowercase makes it simple, because you don't have to guess which characters are capped.

I also like the idea of subfields, but the structure will need to be thought out very carefully. I feel that currently existing standards should be adhered to as much as possible.

As far as the net.*, you could do it like this: net.proto = protocol net.proto.flags = protocol flags net.src.int = source interface net.src.ip = source ip net.src.port = source port net.dst.int = destination interface net.dst.ip = destination ip net.dst.port = destination port

But remember, there's protocols that don't have source or destination ports, like ICMP. In ICMP, you have a type and code. What do we do in a case like this? Also, in the case of ICMP and a Cisco ASA, you're given faddr (foreign address), laddr (local address), and gaddr (global nat address). It doesn't specify the actual source and destination in some cases. What do we do in that case, where we can't necessarily extract all the information we need out of a log message? This is especially the case with Cisco, where some log messages will provide plenty of data, where as other log messages will provide a very minimal amount of data.

NAT will also potentially pose issues to a standard like this. There's going to be a lot to take into account.

mepholic avatar Sep 26 '14 21:09 mepholic

Other potential questions: How do you differentiate between a packet and a flow? Where do we put things like bytes, packet count (in the case of a connection or flow), flow duration? How about NAT or a stateful firewall connection ID?

In one case in my ongoing project here: https://github.com/mepholic/cisco-asa-ls-patterns/ With some of the patterns I currently have defined, I can extract log data on HTTP and FTP inspection that the firewall performs. Included in this data is source and destination IP and port information, but it also has some extra info like FTP user and file, or HTTP URL. This is still technically a network event, as it was extracted from inspection logs, but it also contains application layer data. Does anyone have any suggestions for standardizing field names for data like this?

mepholic avatar Sep 26 '14 21:09 mepholic

Encoding things in field names is generally a bad idea for obvious reasons but one convention I am finding useful is to Capitalize (upper case) the field names for data that is directly pulled from a log entry and lower case field names for derived data or metadata. Kibana for example then neatly separates the two types.

I am in the middle of a config for Exim email logs that will work in many different ways eg for analytics or diagnostics and there are a large number of email addresses some with subtle differences in meaning. Some of those addresses are directly mentioned in the logs which is great for diags.and some are derived to make things like thoughput easy to graph (analytics). Being able to tell the difference at a glance is useful.

gerdesj avatar Oct 21 '14 00:10 gerdesj

@mepholic "How do you differentiate between a packet and a flow?"

You don't: A single packet is the shortest example of a flow!

"This is still technically a network event, as it was extracted from inspection logs, but it also contains application layer data"

You could tag these events by (OSI/ARPA) layer. ARPA is probably best although you have what is generally known as a layer 7 filter 8)

gerdesj avatar Oct 21 '14 00:10 gerdesj

Units: Do you put the units in the field name as a suffix or rely on documentation?

Is it bytes or bits? 1024 or 1000? You can rarely tell from inspection.

My personal preference is generally documentation.

gerdesj avatar Oct 21 '14 00:10 gerdesj

I guess this was done? https://github.com/elastic/ecs

coolacid avatar Jan 21 '19 17:01 coolacid