k6 icon indicating copy to clipboard operation
k6 copied to clipboard

When logging to a file, no logs are written if k6 is killed

Open nadiamoe opened this issue 1 year ago β€’ 6 comments

Brief summary

If k6 is configured to log to a file and the process is killed in the middle of an execution, log messages produced by console.log instructions executed before the process were killed are lost, and the log file is empty.

k6 version

k6 v0.57.0 (go1.24.0, linux/amd64)

OS

Arch Linux

Docker version and image (if applicable)

No response

Steps to reproduce the problem

Take a simple script like this:

import { sleep } from 'k6'

export default function main() {
  console.log(`always log`);
  sleep(30);
}

And run it writing logs to a file:

k6 run --log-output=file=log-then-timeout.log log-then-timeout.js

After a few seconds, kill k6:

sleep 3; pkill -KILL k6

Realize that the log file is empty

nadia@Nadiarch 󱃾 curry-admin@Curry
14:44:21 /tmp/nadia $> wc browser-timeout.log
0 0 0 browser-timeout.log

Expected behaviour

I'd expect log-then-timeout.log to contain the always log string.

Actual behaviour

log-then-timeout.log is empty.

nadiamoe avatar Mar 26 '25 13:03 nadiamoe

As a side note, I am aware that a possible workaround for this would be to log to stderr and consume that instead of a file. There are a couple reasons why we don't do this at the current time which are outside of the scope of this issue.

nadiamoe avatar Mar 26 '25 13:03 nadiamoe

HeyΒ @nadiamoe,

Do you think it can really be considered a bug? Isn't this behavior common across other applications?

I think the reason behind the behavior you're observing is the nature of buffered I/O: logs are buffered and the abrupt process termination prevents them from being flushed. In contraposition, when you SIGINT the process, then it flushes the buffer before stopping.

joanlopez avatar May 26 '25 08:05 joanlopez

Quick update: I did a quick search, and looks like we're delegating the buffered I/O operations to bufio.Writer. Considering that, and that we're already flushing it when the k6 process is stopped, not abruptly interrupted, I'd say this shouldn't be categorize it as a bug.

The only alternative I can see, going further, is to set a timer that flushes the buffer from time to time, but in such case, I'd consider this more like a feature request that something that isn't behaving as expected by default.

That said, and before raising a discussion internally; how feasible would be for you @nadiamoe to use any of the alternatives: either using SIGINT instead of SIGKILL or relying on stderr, as you suggested? πŸ€” I guess your needs and relevance would be key factors when deciding whether or not implement such kind of periodic flush.

Thanks! πŸ™ŒπŸ»

joanlopez avatar May 26 '25 09:05 joanlopez

Hey @joanlopez,

I'm actually using SIGINT already, and giving k6 a grace time of now 3 seconds to terminate before SIGKILL. 3 seconds appears to be enough, but previous value of 1 second was not. What worries me is this value being entirely arbitrary: On a different generation of machines, they may be different, and we have no way to test for that.

For this reason I was suggesting an unbuffered (or less buffered, if you will) approach, as that degrades more gracefully: If those 3 seconds become not enough again, then at least we will have some logs on the output.

nadiamoe avatar Jun 06 '25 10:06 nadiamoe

For this reason I was suggesting an unbuffered (or less buffered, if you will) approach, as that degrades more gracefully: If those 3 seconds become not enough again, then at least we will have some logs on the output.

That's something that could be guaranteed with time-based (instead of current size-based) flushes, which is what I suggested above.

Thanks for sharing the context! πŸ™‡πŸ» Let me ping @mstoykov for now, as I think he might have another take on this.

joanlopez avatar Jun 09 '25 14:06 joanlopez

Flushing every second (or whatever other value we agree on) sounds like a good idea to me!

nadiamoe avatar Jun 20 '25 15:06 nadiamoe

Hey @nadiamoe,

I'm wondering,

I'm actually using SIGINT already, and giving k6 a grace time of now 3 seconds to terminate before SIGKILL. 3 seconds appears to be enough, but previous value of 1 second was not. What worries me is this value being entirely arbitrary: On a different generation of machines, they may be different, and we have no way to test for that.

Have you identified any particular scenario and/or condition where k6 doesn't react as expected to SIGNINT? I guess what @mstoykov preferred, last time I discussed this with him, is to solve the actual issue, if any (like something that prevents k6 from gracefully stopping), instead of patching it with a workaround (e.g. using another flushing strategy).

Thanks!

joanlopez avatar Jun 25 '25 13:06 joanlopez

Did some archaeology to find the original report, on slack here. The increased time required to exit seems to be related to browser.

Narrowing this down to a simpler reproduction case, this script seems to reproduce the issue somewhat:

import { browser } from 'k6/browser';
import { check } from 'https://jslib.k6.io/k6-utils/1.5.0/index.js';

export const options = {
  scenarios: {
    ui: {
      executor: 'shared-iterations',
      options: {
        browser: {
          type: 'chromium',
        },
      },
    },
  },
  thresholds: {
    checks: ['rate==1.0'],
  },
};

export default async function () {
  const context = await browser.newContext();
  const page = await context.newPage();
  console.log(`always log`)

  try {
    await page.goto("https://test.k6.io/my_messages.php");

    await page.locator('input[name="i-do-not-exist"]').type("admin");
  } finally {
    await page.close();
  }
}

If you run this locally (k6 run ...) with a local browser, and send SIGTERM (or SIGINT) while the test is waiting for the unmeetable locator condition, the k6 process terminates quickly. If I run it with a remote browser, then it takes a smidge longer to finish, close to 1 second on my local machine. I can imagine that's the issue: On the less powerful runners, this may have gone above the previous grace time of 1 second and caused the issue.

I guess what @mstoykov preferred, last time I discussed this with him, is to solve the actual issue, if any (like something that prevents k6 from gracefully stopping), instead of patching it with a workaround (e.g. using another flushing strategy).

I'd like to take the chance to contest this a bit. I agree on digging deeper on the issue and trying to fix it, but I think the original report here (lack of flushing) is still worthy of consideration: It would work as a failsafe measure from similar cases if they happened in the future, or from other corner cases where k6 won't have control, like it being OOM killed (which is something that can happen, even if rarely, in SM context).

nadiamoe avatar Jul 03 '25 12:07 nadiamoe

By chance I happened to encounter another scenario where k6 won't exit promptly after receiving SIGTERM (or SIGINT), which can be reproed as follows:

  1. Create a browser test that does something on a page
  2. While the test is running, close the browser window (or otherwise kill the browser process)
  3. ^C the k6 process
  4. k6 will take a long time (sometimes above 15 seconds) to exit.

Upon exit, k6 logs:

ERRO[0013] closing the browser: context deadline exceeded  category="Browser:Close" elapsed="0 ms" source=browser
ERRO[0013] test run was aborted because k6 received a 'interrupt' signal

I think this is supporting evidence that there are scenarios in which k6 will delay its exit by hard to predict amounts, specially with browser is involved. Fixing all those scenarios would be good (if possible), but I think a mechanism to fail gracefully and at least log something when this happens would be particularly beneficial here.

nadiamoe avatar Jul 04 '25 12:07 nadiamoe

[...] If I run it with a remote browser, then it takes a smidge longer to finish, close to 1 second on my local machine. I can imagine that's the issue: On the less powerful runners, this may have gone above the previous grace time of 1 second and caused the issue.

Yeah, I guess here it's not only about how powerful the runners are, but also the unreliability of the network.

By chance I happened to encounter another scenario where k6 won't exit promptly after receiving SIGTERM (or SIGINT), which can be reproed as follows:

  1. Create a browser test that does something on a page
  2. While the test is running, close the browser window (or otherwise kill the browser process)
  3. ^C the k6 process
  4. k6 will take a long time (sometimes above 15 seconds) to exit.

Upon exit, k6 logs:

ERRO[0013] closing the browser: context deadline exceeded  category="Browser:Close" elapsed="0 ms" source=browser
ERRO[0013] test run was aborted because k6 received a 'interrupt' signal

I tried to follow these steps, with a script very similar to the one you shared on the comment right above, and so far I haven't been able to reproduce it.

I tried it with both:

  • Running go run ... and with the binary compiled k6 run ...
  • Using sleep, finding a locator that doesn't exist, etc, and in all those cases, k6 finishes immediately.

I'd like to take the chance to contest this a bit. I agree on digging deeper on the issue and trying to fix it, but I think the original report here (lack of flushing) is still worthy of consideration: [...]

Arguably, flushing is there, it's just not enough in the example described above - the current buffer size is Go's default, which is 4096. However, I'm not sure how often that actually happens, in practice.

Another approach here could be reducing that buffer, but again, as with periodic flushes, may have some implications.

joanlopez avatar Jul 08 '25 13:07 joanlopez

Hey folks πŸ‘‹πŸ»

TL;DR: Why is SIGKILL needed at all if SIGINT is sent first? That could be a bug. For forced kills, the only reliable fix is most likely to add periodic, time-based flushing for logs, rather just changing buffer sizes (which might help though).

As this conversation has outlined, there likely isn't a single silver bullet, especially given the variety of OSes and environments where k6 runs.

I have a couple of clarifying questions based on the thread that might help us isolate the problem:

1. On Graceful Shutdown (SIGINT vs SIGKILL): The original report mentions:

I'm actually using SIGINT already, and giving k6 a grace time of now 3 seconds to terminate before SIGKILL.

This implies that k6 isn't always terminating within the grace period after receiving a SIGINT. My understanding is that a well-behaved process should handle SIGINT (or SIGTERM) and exit gracefully. If k6 is failing to do so, that could be a bug in itself that we should investigate. Is there a known reason why a SIGKILL is necessary after a SIGINT? If there's a hard time constraint for shutdown, that's a requirement we should try to meet.

2. On Data Loss by Design (SIGKILL):

The fact that logs are lost with SIGKILL is, as you've noted, by design on Unix-like systems. The kernel's job is to terminate the process immediately, which means any unflushed I/O buffers in the application's memory are discarded. This is expected behavior when protecting the system from an OOM event or killing a completely unresponsive process.

Moving forward:

Given that we clarify the above ☝🏻 my general sentiment follow's @joanlopez's and the discussion at large:

Primary Solution: Dedicated Log Flusher. I agree with the idea of a periodic flusher. A dedicated mechanism that flushes the log file every N seconds would guarantee that we only lose a very small, predictable window of data. I've also heard this need for more robust flushing from other teams in the past, so it seems like a generally valuable feature. Especially considering we're operating k6 in more and more products, projects and environments.

Secondary Feature: Configurable Buffer Size. While a smaller buffer can lead to more frequent flushes, its effectiveness is also tied to log volume, thus the fix might not be good enough depending on the scenarios we're looking at. I still think it's worth looking into.

oleiade avatar Jul 08 '25 15:07 oleiade

Is there a known reason why a SIGKILL is necessary after a SIGINT? If there's a hard time constraint for shutdown, that's a requirement we should try to meet.

@oleiade SIGTERM-wait-SIGKILL is standard process management practice. It is common for processes to not handle SIGTERM appropriately and/or timely, either due to bugs, unexpected situations, or disagreements on what "timely" means. Sending SIGKILL after a while guarantees that no processes are left in the system, potentially hanging forever.

In the k6 case in particular, I agree that we should fix any of those situations as we encounter them. However, due to the complexity introduced by browsers and remote browsers in particular, I think these fringe cases are somewhat likely to happen. Instead of assuming we will catch all possible fringe scenarios where k6 may not exit within 3 seconds of SIGTERM, I think having a safety net, such as a time-based flusher, is a very sensible approach expecially considering that at least in SM context this failure mode leave users in the dark with no indication of what failed.

The OOM scenario is an orthogonal situation the time-based flusher could also help solve.

nadiamoe avatar Jul 08 '25 16:07 nadiamoe

I feel like there are two issues here. The first issue seems to be generic to k6 (if you follow along with the original description with the basic k6 test) and could be solved by allowing the user to change the characteristics of the buffer -- either with a flush timer or a smaller buffer (or both). I'm not familiar with the mechanism of how logs and other outputs are flushed, but couldn't that be prioritised during a SIGINT? Would that help?

The other problem is with the browser module itself. I believe it would be a good idea for us to revisit how the browser handles these cases (SIGINT and SIGKILL). At the moment it's quite complicated since there are many moving parts (the event system, the context in the VU, the chromium process itself, and the WS connection, which all need to be orchestrated). This is evident if you take a look at the close method 🀯.

ankur22 avatar Jul 09 '25 08:07 ankur22

I do agree that it is better that we fix the issues that are leading to this. Looking at the issue I would recommend that we flush on receiving SIGTERM so we also have logs from just before potentially having to send SIGKILL. Unfortunately the code is super separate and this likely will require a considerable refactor. Or maybe this is a good thing as they are pieces of code that are not coupled at all 🀷 The loki log output already has a ticker, but I would argue that there it was more of a requirement to be able to have a rate limitter that is (way too) effiecient (for what ended up being needed).

I guess at this point I have three observations:

  1. there are still bugs that need to be handled, especially around k6/browser
  2. I would expect the loki log output to be the better case for your use case either way, especially as it doesn't have any filesystem involvement :)
  3. We probably still should extend the file hook to flush everyonce in a while. I would expect that it will be fine to reset the timer every time we "write" a log line so there will be less flushes. And potentially increasing the buffer πŸ€”, maybe not. I also would probably leave this not configurable for a first version, just FYI.

Both 2 and 3 won't fix the cases where the bugs from 1 will kill the process too fast so, that likely is still the thing that should be most looked into.

mstoykov avatar Jul 09 '25 08:07 mstoykov

It seeems like there is an agreement over at least having some timer to flush the buffer in the file logger hook. So lets try to work on that I guess

@nadiamoe would you like to contribute a PR?

mstoykov avatar Jul 14 '25 13:07 mstoykov