Add an option to start an "silent" action
I think it could be useful if it was possible to start/finish an action, but not having the start/finish messages logged out. That would mean we'd still be able to track individual tasks (task_uuid) and order/level of the tasks ( task_level), but we don't care when they were started/finished.
That would also mean a possibility to start&finish action silently - without any messages explicitly logged out, an action would be invisible. And that's actually cool - in our case, we don't care about 80% of the requests & so we don't want to have anything logged. If the "silent" actions were available, we could simply start a task/action all the time, then log something if we need it, without getting hundreds of meaningless lines.
Do you think this would make sense?
We've been talking at ClusterHQ about similar issues where we have noisy logs, we'll see what we come up with. One worry I have with silent actions is that it makes the task tree have missing nodes, which will be confusing (and hard to distinguish from logging bugs).
Instead of a silent action, what about an abridged action? I think the only output necessary to avoid the missing nodes problem is at task ID and a task level.
Well, abridged action wouldn't really help us - we literally don't care about 80%+ of the requests, so even abridged start+end would result in a lot of clutter. It could be made configurable, I guess?
I'm sure Itamar will come up with something cool though :)
You say you "don't care about 80% of messages". Why are they logged in the first place? Could you remove that logging? Or is it that you sometimes care and sometimes don't? In which case, how do you distinguish between the cases?
Here's a use case we have: we're using machinist state machine library, the logs state transitions using Eliot. In a complex state machine this is probably useful. In our case it turns out we don't actually care for some of our state machines.
The use case was kind of an on-line store. As long as people are browsing through products, we don't really want to log anything - we're indexing ngingx access logs, so we don't need logs.
But, once a user performs an actual action (adding to a basket, paying, etc) all kind of logging kicks in, with multiple messages and levels - we wanted to have whole all logs coming from a single request to share task_uuid and levels mechanism.
Now, I could modify our logger to actually start action as soon as we log anything, but that quickly becomes messy - storing states, manually closing things, handling exceptions. Having starting a single "silent" action per-request, with a simple context manager, seemed like a much cleaner idea, and, honestly, easier to implement on eliot itself?
Hm. If you want to suppress a full action, I can imagine something like with silence_logging(), yeah. Not sure if that quite deals with our use case...
Internally we've discussed post-facto silencing: "I've discovered current action isn't worth logging". This requires some sort of buffering layer, otherwise those logs will end up being logged before you can make the decision. On the other hand you don't want to lose them if the process crashes.
http://www.exampler.com/writing/ring-buffer.pdf talks about some relevant patterns. @jml
Sorry, I don't know how to solve your use case. Also, I don't see how a ring-buffer could help :/
I'm not sure if both of our cases can we solved by the same feature. I mean, if you figure out how to solve your thing, I'm sure we could adapt it to our problem (since our is much simpler :)), but maybe we would need two separate solutions?
Have you considered an external post-processing tool? Like, still dumping everything on disk, and only then filtering things out? I cannot say if it would be worth the effort though.
Or, some kind of learning system? Make it guess if a message is likely to be interesting? That's assuming you can accept having some false negatives?
Your use case is as I understand that you know in advance whether a set of actions are important. Ours is post-facto - 90%-through we realize that this action has been a no-op so not worth logging. So yeah more complex use case, and supporting yours would be easier.
So here's a sketch of a design (also relevant to use case from @keturn):
- Provide a passthrough filtering destination for logs which is configured with another destination, the one you want logs to end up when you care about them, e.g. /var/log or logstash or whatever.
- The filtering destination has an on-disk task level temporary ring buffer. That is, it stores up to N tasks, when N+1 arrives the oldest finished one is deleted. Because it'll be a relatively small number the hope is it'll mostly stay in OS buffers and never actually hit disk.
- When certain conditions are met (could be explicit request from business logic code "this task is about logged in user", or could be pre-configured rules within the filter itself like "eliot:traceback was logged") a specific task, or sometimes all current in-progress tasks, are marked as important. That means all messages in the ring buffer and all future messages are for that particular task are passed on to the final destination.
Bunch of edge cases to think though, but the goal is:
- Only important tasks are passed on to the more permanent log storage... and what gets passed through is the whole task.
- Importance of a task does not need to be decided in advance.
- In-progress and recent tasks are available for post-mortem debugging after process crashes.
@jml might be interested too --^
Downsides of this design is that you still have CPU overhead of serializing messages you mostly don't care about.
I guess in-memory ring buffer is also an option for people who want to swap crash-resistance for performance.
Sounds pretty sensible.
Would suggest making the filtering abstraction completely unaware of ring buffer and the like, and providing an implementation that provides the ring buffer stuff.
Not 100% clear how one would implement a filter for "only failed tasks", nor how/whether that would summarize a lot of the boring logs that make up that task (e.g. in our machinist examples). Would like to see an imagined example.
I guess the start action / stop action thing is in fundamental tension with crash-resistance.
I'd like to see an imagined before and after example using the flocker logs to get a better idea of what we're aiming for.
- This does not involve any summarization; the filter is "include whole task or not'. Possibly summarization is better handled by omitting in the code log messages we know are pointless, though that likely requires new eliot APIs too...
- "Only failed tasks" is application-specific and perhaps not quite comprehensive enough, which is why I talked about importance. Some variations of importance might include:
- A
eliot:tracebackwas logged. - Any action within the task failed.
- The top-level action (the task) failed.
- This is an operation by a logged-in user (use case from @rabbbit).
- A deadlock occurred; this means all in-progress tasks are relevant (use case from @keturn).
- Flocker convergence loop iterations that are not boring, where boring is defined as "configuration and state were the same as previous iteration and no actions were calculated as being necessary."
- Task took longer than 100ms.
- Task had a particular side-effect we care about, e.g. queued an item in Celery (use case from @keturn).
- A
- It's not clear to me if ring buffer can really be omitted given you don't know importance up front. But filtering needn't be tied to implementation of ring buffer, indeed.
And yeah, will want to provide a bunch of examples before proceeding.
I added a coupe of use cases above.