drain-java
drain-java copied to clipboard
This a pet project to explore log pattern extraction using DRAIN
= drain java
image:https://github.com/bric3/drain-java/actions/workflows/gradle.yml/badge.svg[Java CI with Gradle,link=https://github.com/bric3/drain-java/actions/workflows/gradle.yml]
== Introduction
drain-java is a continuous log template miner, for each log message it extracts tokens and group them into clusters of tokens. As new log messages are added, drain-java will identify similar token and update the cluster with the new template, or simply create a new token cluster. Each time a cluster is matched a counter is incremented.
These clusters are stored in prefix tree, which is somewhat similar to a trie, but here the tree as a fixed depth in order to avoid long tree traversal. In avoiding deep trees this also helps to keep it balance.
== Usage
First, https://foojay.io/almanac/jdk-11/[Java 11] is required to run drain-java.
=== As a dependency
You can consume drain-java as a dependency in your project io.github.bric3.drain:drain-java-core
,
currently only https://s01.oss.sonatype.org/content/repositories/snapshots/io/github/bric3/drain/[snapshots]
are available by adding this repository.
[source, kotlin]
repositories { maven { url("https://oss.sonatype.org/content/repositories/snapshots/") } }
=== From command line
Since this tool is not yet released the tool needs to be built locally. Also, the built jar is not yet super user-friendly. Since it's not a finished product, anything could change.
.Example usage [source, shell]
$ ./gradlew build $ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar -h
tail - drain Usage: tail [-dfhV] [--verbose] [-n=NUM] [--parse-after-str=FIXED_STRING_SEPARATOR] [--parser-after-col=COLUMN] FILE ... FILE log file -d, --drain use DRAIN to extract log patterns -f, --follow output appended data as the file grows -h, --help Show this help message and exit. -n, --lines=NUM output the last NUM lines, instead of the last 10; or use -n 0 to output starting from beginning --parse-after-str=FIXED_STRING_SEPARATOR when using DRAIN remove the left part of a log line up to after the FIXED_STRING_SEPARATOR --parser-after-col=COLUMN when using DRAIN remove the left part of a log line up to COLUMN -V, --version Print version information and exit. --verbose Verbose output, mostly for DRAIN or errors $ java -jar tailer/build/libs/tailer-0.1.0-SNAPSHOT-all.jar --version Versioned Command 1.0 Picocli 4.6.3 JVM: 19 (Amazon.com Inc. OpenJDK 64-Bit Server VM 19+36-FR) OS: Mac OS X 12.6 x86_64
By default, the tool act similarly to tail
, and it will output the file to the stdout.
The tool can follow a file if the --follow
option is passed.
However, when run with the --drain
this tool will classify log lines using DRAIN, and will
output identified clusters.
Note that this tool doesn't handle multiline log messages (like logs that contains a stacktrace).
On the SSH log data set we can use it this way.
[source, shell]
$ java -jar build/libs/drain-java-1.0-SNAPSHOT-all.jar
-d \ <1>
-n 0 \ <2>
--parse-after-str "]: " <3>
build/resources/test/SSH.log <4>
<1> Identify patterns in the log
<2> Starts from the beginning of the file (otherwise it starts from the last 10 lines)
<3> Remove the left part of log line (Dec 10 06:55:46 LabSZ sshd[24200]:
), ie effectively
ignoring some variable elements like the time.
<4> The log file
.log pattern clusters and their occurences [source]
---- Done processing file. Total of 655147 lines, done in 1.588 s, 51 clusters <1> 0010 (size 140768): Failed password for <> from <> port <> ssh2 <2> 0009 (size 140701): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <> <> 0007 (size 68958): Connection closed by <> [preauth] 0008 (size 46642): Received disconnect from <> 11: <> <> <> 0014 (size 37963): PAM service(sshd) ignoring max retries; <> > 3 0012 (size 37298): Disconnecting: Too many authentication failures for <> [preauth] 0013 (size 37029): PAM <> more authentication <> logname= uid=0 euid=0 tty=ssh ruser= <> <> 0011 (size 36967): message repeated <> times: [ Failed password for <> from <> port <> ssh2] 0006 (size 20241): Failed <> for invalid user <> from <> port <> ssh2 0004 (size 19852): pam unix(sshd:auth): check pass; user unknown 0001 (size 18909): reverse mapping checking getaddrinfo for <> <> failed - POSSIBLE BREAK-IN ATTEMPT! 0002 (size 14551): Invalid user <> from <> 0003 (size 14551): input userauth request: invalid user <> [preauth] 0005 (size 14356): pam unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= <> 0018 (size 1289): PAM <> more authentication <> logname= uid=0 euid=0 tty=ssh ruser= <*> 0024 (size 952): fatal: Read from socket failed: Connection reset by peer [preauth] ...
<1> 51 types of logs were identified from 655147 lines in 1.588s
<2> There was 140768
similar log messages with this pattern, with 3
positions
where the token is identified as parameter <*>
.
On the same dataset, the java implementation performed roughly around 10 times faster. As my implementation does not yet have masking, mask configuration was removed in the Drain3 implementation.
=== From Java
This tool is not yet intended to be used as a library, but for the curious the DRAIN algorythm can be used this way:
.Minimal DRAIN example [source, java]
var drain = Drain.drainBuilder() .additionalDelimiters("_") .depth(4) .build() Files.lines(Paths.get("build/resources/test/SSH.log"), StandardCharsets.UTF_8) .forEach(drain::parseLogMessage);
// do something with clusters drain.clusters();
== Status
Pieces of puzzle are coming in no particular order, I first bootstrapped the code from a simple Java file. Then I wrote in Java an implementation of Drain. Now here's what I would like to do.
.Todo
- [ ] More unit tests
- [x] Wire things together
- [ ] More documentation
- [x] Implement tail follow mode (currently in drain mode the whole file is read and stops once finished)
- [ ] In follow drain mode dump clusters on forced exit (e.g. for example when hitting
ctrl
+c
) - [x] Start reading from the last x lines (like
tail -n 30
) - [ ] Implement log masking (e.g. log contain an email, or an IP address which may be considered as private data)
.For later
- [ ] Json message field extraction
- [ ] How to handle prefixes : Dates, log level, etc. ; possibly using masking
- [ ] Investigate marker with specific behavior, e.g. log level severity
- [ ] Investigate log with stacktraces (likely multiline)
- [ ] Improve handling of very long lines
- [ ] Logback appender with micrometer counter
== Motivation
I was inspired by a https://sayr.us/log-pattern-recognition/logmine/[blog article from one of my colleague on LogMine], -- many thanks to him for doing the initial research and explaining concepts --, we were both impressed by the log pattern extraction of https://docs.datadoghq.com/logs/explorer/patterns/[Datadog's Log explorer], his blog post sparked my interest.
After some discussion together, we saw that Drain was a bit superior to LogMine. Googling Drain in Java didn't yield any result, although I certainly didn't search exhaustively, but regardless this triggered the idea to implement this algorithm in Java.
== References
The Drain port is mostly a port of https://github.com/IBM/Drain3[Drain3] done by IBM folks (David Ohana, Moshik Hershcovitch). IBM's Drain3 is a fork of the https://github.com/logpai/logparser[original work] done by the LogPai team based on the paper of Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu.
I didn't follow up on other contributors of these projects, reach out if you think you have been omitted.
For reference here's the linked I looked at:
- https://logparser.readthedocs.io/
- https://github.com/logpai/logparser
- https://github.com/IBM/Drain3
- https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf (a copy of this publication accessible link:doc/pjhe_icws2017.pdf[there])