pueue icon indicating copy to clipboard operation
pueue copied to clipboard

Filter log output by stdout/stderr, and time

Open Nukesor opened this issue 3 years ago • 6 comments

The PR https://github.com/Nukesor/pueue/pull/286 combined the two streams of a task's output back into a single output.

It would now be awesome to get the possibility to filter for either stderr/stdout and maybe even time.

For this to work, we would need some way to combine the two output streams without loosing temporal/chronological information. Furthermore the resulting output should be some kind of dataformat that should be structured and supports streaming.

The requirements to this feature are:

  1. Task log output should be in the correct chronological order.
  2. Log output should be error-resistant (partial writes shouldn't break the whole log)
  3. We have to be able to differentiate between stderr and stdout output. I.e. we have to be able to sort by it.
  4. Remote streaming should still be performant (dataformat should support streaming)
  5. The dataformat should have support for meta-data such as time-stamps

We should prevent to load whole files into memory. If we have to load them into memory, we should work with streams and utilize in-memory compression wherever possible. Log files can be huge and we already got reports of crashing PCs due to high RAM usage.

Help, suggestions or even PRs are very welcome!

Originating issue: #228

Nukesor avatar Sep 12 '21 07:09 Nukesor

Help, suggestions or even PRs are very welcome!

I would be open to contributing. It will take me a little bit to get more familiar with the codebase, but I can send an update in a couple of days if you're fine with that

CosmicHorrorDev avatar Jan 18 '22 13:01 CosmicHorrorDev

Feel free to tinker around a bit :)

I would love to hear about your approach on this problem.

Nukesor avatar Jan 19 '22 09:01 Nukesor

Once you have some vision on how to implement this, it would be awesome to hear about it.

Especially how you plan to combine the two streams into a singular stream whilst loosing as little (or no) temporal information as possible. And what kind of dataformat you have in mind, since it has to support streaming.

I stumbled upon os_pipe when researching on how to merge pipes. Not sure how useful this might be, though.

In my mind such an implementation would probably need a thread per pipe per task? An alternative to raw threads would probably be some async impl on top, so we don't spawn a huge amount of actual threads but rather let the tokio scheduler handle the rest.

The model of one thread/async-task per pipe might get problematic though, if a user runs a hundred tasks in parallel and each task produces a huge amount of logs. I.e. what happens if the log threads/async-tasks don't manage to keep up with the task's output. It would also be interesting to investigate how this behaves if the system has a high load in general.

The best solution would obviously be to have a solution which doesn't need threading. Not sure if this can be easily achieved though.

Nukesor avatar Jan 22 '22 21:01 Nukesor

I haven't had too much spare time to start digging into this yet, but my initial plan was to have a separate thread for stderr and stout like you mentioned. It's possible that there is a more proper technique, but it seemed like a simple way to do blocking reads to avoid having to poll continuously. From there I was just planning on reading each of the streams by line and output the line along with a timestamp for when it was read. With that I was thinking a file for each stream along with a separate file for the timestamps (format isn't super important, it could even be as basic as writing the timestamp as bytes to the file

With that information you can freely show either stream separately or show the output combined by walking through the timestamps in order. It also sets up for showing timestamps for the output which I think would be a very nice feature

It is certainly worth testing how it handles dealing with a large number of tasks that produce a lot of output (notably the worst case would be something like yes which outputs very short lines. I don't think it's possible to avoid threading entirely since the solution there would be to do constantly poll if there is output in either stream which will consume a lot of unnecessary CPU power even for jobs that are producing little to no output.

There could also be issues if people ran commands that produced a lot of content like head -c 10G /dev/zero which would consume a lot of RAM while resizing the buffer to hold the whole line, but also I think it is reasonable to say that someone wouldn't intentionally run a command like that. If it is a possibility then maybe we could have an option to disable timestamps for some job?

CosmicHorrorDev avatar Jan 23 '22 04:01 CosmicHorrorDev

Hm. The thing about doing it that way might be that stdout/stderr doesn't always get flushed in one go. Especially python scripts are known to notoriously flush late.

As long as one uses two different pipes/files, this could lead to a scenario where we lose temporal information. For example, let's look at a scenario where stderr logs that accumulated over the last 5 minutes get flushed at once. A short time after that the stdout logs for the last ~2 minutes get flushed.

The original logs might have looked something like this

12:41:01,stderr, ERR stderr_output
12:41:02,stderr, ERR stderr_output
12:41:03,stderr, ERR stderr_output
12:41:04,stderr, GOOD stdout_output
12:41:06,stderr, GOOD stdout_output
12:41:20,stderr, ERR stderr_output
12:41:23,stderr, ERR stderr_output

The final output would look like this:

12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:01,stderr, ERR stderr_output
12:42:40,stderr, GOOD stdout_output
12:42:40,stderr, GOOD stdout_output

We wouldn't only lose the correct timing of the log line (which is something we might lose anyway), but we would also lose the correct temporal ordering of the lines in question. When using a single pipe/file descriptor, we at least shouldn't lose the correct order, as everything is flushed at once.

There exists an open PR, which implements similar behavior for journalctl. It would be interesting to see how they handled this problem. https://github.com/systemd/systemd/pull/6599 I might also take a look at systemd's codebase in the near future.

Nukesor avatar Jan 23 '22 15:01 Nukesor

Those are a lot of good points. I'll have to keep digging into more stuff, but it looks like we certainly need some more investigation before we can start an implementation

CosmicHorrorDev avatar Jan 25 '22 03:01 CosmicHorrorDev

Turns out the journalctl implementation has the same problem as I explained above: https://github.com/systemd/systemd/pull/6599#issuecomment-326045899

If we do this, we should probably do it the same way as journalctl planned to do. I.e. this needs to be an opt-in so the user can decide whether they are willing to trade ergonimic filtering by stderr/stdout with temporal consistency.

The default should be the current behavior.

Nukesor avatar Oct 02 '22 12:10 Nukesor

I'm closing this for now, as this will require a major internal refactoring and it's not clear how benefitial this'll be yet.

Nukesor avatar Nov 14 '22 13:11 Nukesor