bric3/drain-java: This a pet project to explore log pattern extracti...

= drain java

image:https://github.com/bric3/drain-java/actions/workflows/gradle.yml/badge.svg[Java CI with Gradle,link=https://github.com/bric3/drain-java/actions/workflows/gradle.yml]

== Introduction

drain-java is a continuous log template miner, for each log message it extracts tokens and group them into clusters of tokens. As new log messages are added, drain-java will identify similar token and update the cluster with the new template, or simply create a new token cluster. Each time a cluster is matched a counter is incremented.

These clusters are stored in prefix tree, which is somewhat similar to a trie, but here the tree as a fixed depth in order to avoid long tree traversal. In avoiding deep trees this also helps to keep it balance.

== Usage

First, https://foojay.io/almanac/jdk-11/[Java 11] is required to run drain-java.

=== As a dependency

You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core, currently only https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/bric3/drain/[snapshots] are available by adding this repository.

[source, kotlin]

repositories { maven { url("https://oss.sonatype.org/content/repositories/snapshots/") } }

=== From command line

Since this tool is not yet released the tool needs to be built locally. Also, the built jar is not yet super user-friendly. Since it's not a finished product, anything could change.

.Example usage [source, shell]

$ ./gradlew build $ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h

tail - drain Usage: tail [-dfhV] [--verbose] [-n=NUM] [--parse-after-str=FIXED_STRING_SEPARATOR] [--parser-after-col=COLUMN] FILE ... FILE log file -d, --drain use DRAIN to extract log patterns -f, --follow output appended data as the file grows -h, --help Show this help message and exit. -n, --lines=NUM output the last NUM lines, instead of the last 10; or use -n 0 to output starting from beginning --parse-after-str=FIXED_STRING_SEPARATOR when using DRAIN remove the left part of a log line up to after the FIXED_STRING_SEPARATOR --parser-after-col=COLUMN when using DRAIN remove the left part of a log line up to COLUMN -V, --version Print version information and exit. --verbose Verbose output, mostly for DRAIN or errors $ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version Versioned Command 1.0 Picocli 4.6.3 JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR) OS: Mac OS X 12.6 x86_64

By default, the tool act similarly to tail, and it will output the file to the stdout. The tool can follow a file if the --follow option is passed. However, when run with the --drain this tool will classify log lines using DRAIN, and will output identified clusters. Note that this tool doesn't handle multiline log messages (like logs that contains a stacktrace).

On the SSH log data set we can use it this way.

[source, shell]

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar
-d \ <1> -n 0 \ <2> --parse-after-str "]: " <3> build/resources/test/SSH.log <4>

<1> Identify patterns in the log <2> Starts from the beginning of the file (otherwise it starts from the last 10 lines) <3> Remove the left part of log line (Dec 10 06:55:46 LabSZ sshd[24200]: ), ie effectively ignoring some variable elements like the time. <4> The log file

.log pattern clusters and their occurences [source]

---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters <1> 0010 (size 140768): Failed password for <> from <> port <> ssh2 <2> 0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <> <> 0007 (size 68958): Connection closed by <> [preauth] 0008 (size 46642): Received disconnect from <> 11: <> <> <> 0014 (size 37963): PAM service(sshd) ignoring max retries; <> > 3 0012 (size 37298): Disconnecting: Too many authentication failures for <> [preauth] 0013 (size 37029): PAM <> more authentication <> logname= uid=0 euid=0 tty=ssh ruser= <> <> 0011 (size 36967): message repeated <> times: [ Failed password for <> from <> port <> ssh2] 0006 (size 20241): Failed <> for invalid user <> from <> port <> ssh2 0004 (size 19852): pam unix(sshd:auth): check pass; user unknown 0001 (size 18909): reverse mapping checking getaddrinfo for <> <> failed - POSSIBLE BREAK-IN ATTEMPT! 0002 (size 14551): Invalid user <> from <> 0003 (size 14551): input userauth request: invalid user <> [preauth] 0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <> 0018 (size 1289): PAM <> more authentication <> logname= uid=0 euid=0 tty=ssh ruser= <*> 0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth] ...

<1> 51 types of logs were identified from 655147 lines in 1.588s <2> There was 140768 similar log messages with this pattern, with 3 positions where the token is identified as parameter <*>.

On the same dataset, the java implementation performed roughly around 10 times faster. As my implementation does not yet have masking, mask configuration was removed in the Drain3 implementation.

=== From Java

This tool is not yet intended to be used as a library, but for the curious the DRAIN algorythm can be used this way:

.Minimal DRAIN example [source, java]

var drain = Drain.drainBuilder() .additionalDelimiters("_") .depth(4) .build() Files.lines(Paths.get("build/resources/test/SSH.log"), StandardCharsets.UTF_8) .forEach(drain::parseLogMessage);

// do something with clusters drain.clusters();

== Status

Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java file. Then I wrote in Java an implementation of Drain. Now here's what I would like to do.

.Todo

[ ] More unit tests
[x] Wire things together
[ ] More documentation
[x] Implement tail follow mode (currently in drain mode the whole file is read and stops once finished)
[ ] In follow drain mode dump clusters on forced exit (e.g. for example when hitting ctrl+c)
[x] Start reading from the last x lines (like tail -n 30)
[ ] Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)

.For later

[ ] Json message field extraction
[ ] How to handle prefixes : Dates, log level, etc. ; possibly using masking
[ ] Investigate marker with specific behavior, e.g. log level severity
[ ] Investigate log with stacktraces (likely multiline)
[ ] Improve handling of very long lines
[ ] Logback appender with micrometer counter

== Motivation

I was inspired by a https://sayr.us/log-pattern-recognition/logmine/[blog article from one of my colleague on LogMine], -- many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log pattern extraction of https://docs.datadoghq.com/logs/explorer/patterns/[Datadog's Log explorer], his blog post sparked my interest.

After some discussion together, we saw that Drain was a bit superior to LogMine. Googling Drain in Java didn't yield any result, although I certainly didn't search exhaustively, but regardless this triggered the idea to implement this algorithm in Java.

== References

The Drain port is mostly a port of https://github.com/IBM/Drain3[Drain3] done by IBM folks (David Ohana, Moshik Hershcovitch). IBM's Drain3 is a fork of the https://github.com/logpai/logparser[original work] done by the LogPai team based on the paper of Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.

I didn't follow up on other contributors of these projects, reach out if you think you have been omitted.

For reference here's the linked I looked at:

https://logparser.readthedocs.io/
https://github.com/logpai/logparser
https://github.com/IBM/Drain3
https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf (a copy of this publication accessible link:doc/pjhe_icws2017.pdf[there])

drain-java
drain-java copied to clipboard

Metadata

[source, kotlin]

repositories { maven { url("https://oss.sonatype.org/content/repositories/snapshots/") } }

.Example usage [source, shell]

[source, shell]

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar
-d \ <1> -n 0 \ <2> --parse-after-str "]: " <3> build/resources/test/SSH.log <4>

.log pattern clusters and their occurences [source]

.Minimal DRAIN example [source, java]

// do something with clusters drain.clusters();

← Metadata

Owner

Metadata

drain-java drain-java copied to clipboard

Metadata

[source, kotlin]

repositories { maven { url("https://oss.sonatype.org/content/repositories/snapshots/") } }

.Example usage [source, shell]

[source, shell]

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar -d \ <1> -n 0 \ <2> --parse-after-str "]: " <3> build/resources/test/SSH.log <4>

.log pattern clusters and their occurences [source]

.Minimal DRAIN example [source, java]

// do something with clusters drain.clusters();

← Metadata

Owner

Metadata

drain-java
drain-java copied to clipboard

$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar
-d \ <1> -n 0 \ <2> --parse-after-str "]: " <3> build/resources/test/SSH.log <4>