oklog icon indicating copy to clipboard operation
oklog copied to clipboard

Question: What should logging stack built around OkLog look like?

Open zarbis opened this issue 6 years ago • 9 comments

I like design of OkLog and displeased with ELK at the same time, so I want to try to implement logging stack that is built around it. My main use case is: manually consult application logs when they behave weirdly. So no automated analytics and stuff where 100% of logs should be parsed into structured documents.

OkLog covers pretty neatly ingestion and storage, the question is on querying/parsing side. AFAIK there is no "Kibana for OkLog" and one hardly can be made (at least universal one), since it stores raw unstructured logs and makes no attempt to parse it like Logstash does.

If I understand the intent correctly, OkLog user's should create their own querying UI that would "grep" raw log entries and then optionally parse them in application-specific way to further filter or aggregate results.

Although I like the idea of not parsing logs prematurely, since 99.995% of them will never be seen by operator in my case, I'm not sure of how smooth real-time querying+parsing experience would be. I know that parsing performance has nothing to do with OkLog itself, but want to know if it is a viable idea that has been successfully implemented by someone.

Or may be I'm completely misinterpreting OkLog's usage intent and would like to know what a proper intended design of logging stack revolving around OkLog.

zarbis avatar Jan 18 '18 13:01 zarbis

Hej @zarbis, before going further into detail have you checked the built-in ui?

xla avatar Jan 18 '18 13:01 xla

@xla last time I've had a quick look, only thing it did for me is to show count of matched entries without actually displaying any of them. It might be entirely up to me doing something stupid, but at a time I've assumed that it's mostly an interactive query debugger/verifier (that you would later put into your own system) rather that actual query UI.

zarbis avatar Jan 18 '18 14:01 zarbis

This sounds like an error in the UI itself. It's meant to be the go-to tool to explore your log corpus interactively.

If I understand the intent correctly, OkLog user's should create their own querying UI that would "grep" raw log entries and then optionally parse them in application-specific way to further filter or aggregate results.

What are examples of application-specific ways for fitlering and aggregation?

Although I like the idea of not parsing logs prematurely, since 99.995% of them will never be seen by operator in my case, I'm not sure of how smooth real-time querying+parsing experience would be. I know that parsing performance has nothing to do with OkLog itself, but want to know if it is a viable idea that has been successfully implemented by someone.

The CLI as well as the UI have shown very reasonable performances for realtime querying and streaming for large log volumes. When you mention parsing does that go back into understanding of log lines as structured documents mentioned above, or is it something else?

xla avatar Jan 18 '18 14:01 xla

What are examples of application-specific ways for fitlering and aggregation?

For example backend application outputs its logs in JSON or similar structured but not "grep-friendly" format and exposes thigs like incoming response body/headers, response codes and other "free-form commentaries".

It is useful to see number of responses grouped by response code and then take a look at what majority of those "free-form commentaries" are. For example if I see large number of 4XX and 5XX and 95% of them are with "db connection timeout" commentary, I immediately know where to dig further.

While first layer of aggregation by response codes or request paths steps into Prometheus territory, second layer of aggregation by error messages, stack traces or other free-form context entities is not really suitable for metrics and leans toward logs.

When you mention parsing does that go back into understanding of log lines as structured documents mentioned above, or is it something else?

Exactly that. Implementing scenario described above would require treating log lines like structured documents, which seems too much to ask of OkLog and would require second custom layer on top of "just greppin'" OkLog API.

zarbis avatar Jan 18 '18 15:01 zarbis

Thanks for these use cases, we're actively thinking about what an indexing/searching component would look like, concrete requirements are extremely useful for that. With that said, I think you can get pretty close to what you want with the oklog query commandline tool, combined with other common tools, in the true UNIX fashion.

It is useful to see number of responses grouped by response code

$ oklog query -q '"response_code":' | head -n3
{... "response_code": 200, "path": "/good", ...}
{... "response_code": 503, "path": "/bad", ...}
{... "response_code": 200, "path": "/other", ...}

$ oklog query -q '"response_code":' | jq .response_code | head -n3
200
503
200

$ oklog query -q '"response_code":' | jq .response_code | sort -n | uniq -c | sort -rn
   2335 200
    103 503
      2 404

and then take a look at what majority of those "free-form commentaries" are

$ oklog query -q '"response_code": 503' | head -n3
{... "response_code": 503, "path": "/bad", "err": "bad route", ...}
{... "response_code": 503, "path": "/login", "err": "DB connection error: 10200", ...}
{... "response_code": 503, "path": "/purchase", "err": "DB connection error: 10200", ...}

$ oklog query -q '"response_code": 503' | jq .err | sort | uniq -c | sort -rn
  95 DB connection error: 10200
   7 solar flare corrupted one bit of memory at 0x0000f000
   1 bad route

peterbourgon avatar Jan 18 '18 16:01 peterbourgon

@peterbourgon I think it should follow it's motto of "Prometheus for logs" and implement simple UI that allows to draw basic conclusions from query results, and not just provides filtered dump if it's database. In Prometheus it comes with 2 ways:

  1. query language supports nested filters and aggregations
  2. UI has quick and dirty visualization Grafana will always be better for "serious" use, but ability to have a quick but meaningful glance over stored date is valuable.

Of course I'm not encouraging to write your own PromQL but rather implement UNIX-style pipeline, where query can consist of basic "grep"s, "sort"s, "head"s and "uniq"s. This way will allow to work on a broadest sets of log formats at a minimum comfort level and will be a good first step before thinking on implementing format-specific parsing.

zarbis avatar Jan 18 '18 17:01 zarbis

@zarbis regarding the UI - you may have run into issue #63

What browser are you using?

timwebster9 avatar Jan 21 '18 12:01 timwebster9

@timwebster9 I've actually commented on that issue after this discussion.

zarbis avatar Jan 21 '18 13:01 zarbis

The CLI as well as the UI have shown very reasonable performances for realtime querying and streaming for large log volumes.

@xla do you mind quantifying this to make it more concrete?

yurishkuro avatar Jan 21 '18 16:01 yurishkuro