lnav icon indicating copy to clipboard operation
lnav copied to clipboard

Only consider maximum N characters for matching per line

Open hpcorona opened this issue 8 years ago • 2 comments

Currently there is a maximum match_limit and recursion set for pcre.

https://github.com/tstack/lnav/blob/7c74ecf1e73190c6192b861dfd83b626e1af43d1/src/pcrepp.hh#L561

But this causes very long lines (for example >100,000 characters lines) to error with:

pcrepp.hh:481 pcre err -8

It would be really nice if we could specify the maximum numbers of characters to try to match per line, and also, to tell the formatter that any extra byte after the match would be considered as part of the "body" field.

For example consider this "simplified" regex:

"xLineFormat" : {
    "pattern": "^(?<timestamp>) (?<level>) ",
    "match_limit": 30,
    "field_for_extra_characters": "body",
    "extra_caracters_include_multilines": true
}

Now consider the following sample line:

"sample": [
    {
        "line": "2016-09-28 12:00:00 DEBUG <THIS IS A REALLY LONG DEBUG LINE THAT MAY SPAN\nOVER SEVERAL LINES>"
    }
]

In this previous example, we would only consider 30 characters for matching. If we got a matching on timestamp and level; then we will use all the extra characters for the body field.

This will help a lot with performance, and to parse files that currently cannot be handled in SQL.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/38216180-only-consider-maximum-n-characters-for-matching-per-line?utm_campaign=plugin&utm_content=tracker%2F449456&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F449456&utm_medium=issues&utm_source=github).

hpcorona avatar Sep 29 '16 05:09 hpcorona

+1

In my case I am trying to parse Java log files with large stacktraces. Lnav chokes on any 'body' pattern over 84Kb, resulting in an unformatted line (no fields parsed), and a debug log message:

 2018-11-19T13:45:40.476 E pcrepp.hh:516 pcre err -27

My workaround is to increase the JIT_STACK_MAX_SIZE constant in src/pcrepp.cc and recompile. I imagine this is hurting performance though,

I like @hpcorona's suggestion for body-less regexes, where anything unmatched at the end is implicitly the body.

For the record, here is my pattern with a newline-inclusive body match:

{
        "java_log" : {
                "regex" : {
                        "extralong format" : {
                                "pattern" : "^(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}) (?<alert_level>\\w+) (?<body>(?:.|\\n)*)$"
                        }
                }
        }
}

and a sample line:

2018-11-19 12:06:34,252 ERROR java.lang.IllegalStateException: java.io.FileNotFoundException
        at org.apache.catalina.webresources.AbstractSingleArchiveResourceSet.getArchiveEntry(AbstractSingleArchiveResourceSet.java:100)
        at .... 800 lines/85kb of stacktrace 

jefft avatar Nov 19 '18 03:11 jefft

@jefft I have tried your pattern in regex101 and it does consume the whole file, not just a single / multiple lines until the next timestamp pattern occurrs.

^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) (?<alert_level>\w+) (?<body>(?:.|\n[^\d\n])*)$

Only if I supply a break pattern, i.e. a newline followed by at least one non-numeric character or a newline it consumes only the wanted lines. Could you double check ?

stefan123t avatar May 16 '22 13:05 stefan123t