pueue
pueue copied to clipboard
Filter log output by stdout/stderr, and time
The PR https://github.com/Nukesor/pueue/pull/286 combined the two streams of a task's output back into a single output.
It would now be awesome to get the possibility to filter for either stderr
/stdout
and maybe even time.
For this to work, we would need some way to combine the two output streams without loosing temporal/chronological information. Furthermore the resulting output should be some kind of dataformat that should be structured and supports streaming.
The requirements to this feature are:
- Task log output should be in the correct chronological order.
- Log output should be error-resistant (partial writes shouldn't break the whole log)
- We have to be able to differentiate between
stderr
andstdout
output. I.e. we have to be able to sort by it. - Remote streaming should still be performant (dataformat should support streaming)
- The dataformat should have support for meta-data such as time-stamps
We should prevent to load whole files into memory. If we have to load them into memory, we should work with streams and utilize in-memory compression wherever possible. Log files can be huge and we already got reports of crashing PCs due to high RAM usage.
Help, suggestions or even PRs are very welcome!
Originating issue: #228
Help, suggestions or even PRs are very welcome!
I would be open to contributing. It will take me a little bit to get more familiar with the codebase, but I can send an update in a couple of days if you're fine with that
Feel free to tinker around a bit :)
I would love to hear about your approach on this problem.
Once you have some vision on how to implement this, it would be awesome to hear about it.
Especially how you plan to combine the two streams into a singular stream whilst loosing as little (or no) temporal information as possible. And what kind of dataformat you have in mind, since it has to support streaming.
I stumbled upon os_pipe when researching on how to merge pipes. Not sure how useful this might be, though.
In my mind such an implementation would probably need a thread per pipe per task?
An alternative to raw threads would probably be some async
impl on top, so we don't spawn a huge amount of actual threads but rather let the tokio scheduler handle the rest.
The model of one thread/async-task per pipe might get problematic though, if a user runs a hundred tasks in parallel and each task produces a huge amount of logs. I.e. what happens if the log threads/async-tasks don't manage to keep up with the task's output. It would also be interesting to investigate how this behaves if the system has a high load in general.
The best solution would obviously be to have a solution which doesn't need threading. Not sure if this can be easily achieved though.
I haven't had too much spare time to start digging into this yet, but my initial plan was to have a separate thread for stderr and stout like you mentioned. It's possible that there is a more proper technique, but it seemed like a simple way to do blocking reads to avoid having to poll continuously. From there I was just planning on reading each of the streams by line and output the line along with a timestamp for when it was read. With that I was thinking a file for each stream along with a separate file for the timestamps (format isn't super important, it could even be as basic as writing the timestamp as bytes to the file
With that information you can freely show either stream separately or show the output combined by walking through the timestamps in order. It also sets up for showing timestamps for the output which I think would be a very nice feature
It is certainly worth testing how it handles dealing with a large number of tasks that produce a lot of output (notably the worst case would be something like yes
which outputs very short lines. I don't think it's possible to avoid threading entirely since the solution there would be to do constantly poll if there is output in either stream which will consume a lot of unnecessary CPU power even for jobs that are producing little to no output.
There could also be issues if people ran commands that produced a lot of content like head -c 10G /dev/zero
which would consume a lot of RAM while resizing the buffer to hold the whole line, but also I think it is reasonable to say that someone wouldn't intentionally run a command like that. If it is a possibility then maybe we could have an option to disable timestamps for some job?
Hm. The thing about doing it that way might be that stdout/stderr doesn't always get flushed in one go. Especially python scripts are known to notoriously flush late.
As long as one uses two different pipes/files, this could lead to a scenario where we lose temporal information. For example, let's look at a scenario where stderr logs that accumulated over the last 5 minutes get flushed at once. A short time after that the stdout logs for the last ~2 minutes get flushed.
The original logs might have looked something like this
12:41:01,stderr, ERR stderr_output
12:41:02,stderr, ERR stderr_output
12:41:03,stderr, ERR stderr_output
12:41:04,stderr, GOOD stdout_output
12:41:06,stderr, GOOD stdout_output
12:41:20,stderr, ERR stderr_output
12:41:23,stderr, ERR stderr_output
The final output would look like this:
12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:40,stderr, GOOD stdout_output
12:42:40,stderr, GOOD stdout_output
We wouldn't only lose the correct timing of the log line (which is something we might lose anyway), but we would also lose the correct temporal ordering of the lines in question. When using a single pipe/file descriptor, we at least shouldn't lose the correct order, as everything is flushed at once.
There exists an open PR, which implements similar behavior for journalctl
. It would be interesting to see how they handled this problem. https://github.com/systemd/systemd/pull/6599
I might also take a look at systemd
's codebase in the near future.
Those are a lot of good points. I'll have to keep digging into more stuff, but it looks like we certainly need some more investigation before we can start an implementation
Turns out the journalctl implementation has the same problem as I explained above: https://github.com/systemd/systemd/pull/6599#issuecomment-326045899
If we do this, we should probably do it the same way as journalctl planned to do. I.e. this needs to be an opt-in so the user can decide whether they are willing to trade ergonimic filtering by stderr/stdout with temporal consistency.
The default should be the current behavior.
I'm closing this for now, as this will require a major internal refactoring and it's not clear how benefitial this'll be yet.