coder icon indicating copy to clipboard operation
coder copied to clipboard

Add startup logs to build log UI

Open ammario opened this issue 3 years ago • 2 comments

I have a complex startup script:

resource "coder_agent" "dev" {
  arch           = "amd64"
  os             = "linux"
  startup_script = <<EOF
    #!/bin/sh
    # install and start code-server
    curl -fsSL https://code-server.dev/install.sh | sh
    code-server --auth none --port 13337
    sudo service docker start
    if [ -f ~/personalize ]; then ~/personalize 2>&1 | tee  ~/.personalize.log; fi
    EOF
}

And it's not clear to me where I could see its log output. It would also be nice to not have to redirect command execution to logs manually.

The natural place for me to find this is the build log on the dashboard, but I also expect some file in my workspace to contain the log.

While we're at it, we should include the agent logs (/var/log/coder-agent.log) in the dashboard build log.

ammario avatar Jul 13 '22 00:07 ammario

So, the log is in /var/log/coder-startup-script.log. I had to read the documentation to find that, so the issue still stands as a major usability improvement.

ammario avatar Jul 31 '22 22:07 ammario

We probably want to rotate on each workspace start. Or we can save that for lifecycle hooks.

ketang avatar Jul 31 '22 22:07 ketang

Let's make sure we use a PrefixSuffixWriter (First N bytes + Last N bytes) to bound how much storage these logs can take. We also may need the same for the provisioner logs, but I haven't investigated the code.

ammario avatar Oct 17 '22 21:10 ammario

An implementation challenge is startup_script can block for an indeterminate amount of time. Some of our examples spin off code-server asynchronously, but others block.

ammario avatar Oct 18 '22 18:10 ammario

Let's make sure we use a PrefixSuffixWriter (First N bytes + Last N bytes) to bound how much storage these logs can take. We also may need the same for the provisioner logs, but I haven't investigated the code.

The only issue is if we're expecting some sort of prompt to be written to the log (something I know @kylecarbs mentioned when we discussed this Monday) then it might get truncated by the PrefixSuffixWriter.

We could potentially only keep the startup logs of the most recent build and maybe also use a PrefixSuffixWriter with a 1MB cap to prevent any maliciousness. That would more than likely solve 99% of cases and prevent us from ballooning over time.

The startup logs of previous builds are not likely to be useful.

sreya avatar Oct 19 '22 20:10 sreya

Some kind of limiter is absolutely necessary to prevent DoS, e.g cat /dev/urandom.

We could potentially only keep the startup logs of the most recent build and maybe also use a PrefixSuffixWriter with a 1MB cap to prevent any maliciousness. That would more than likely solve 99% of cases and prevent us from ballooning over time.

This sounds good to me.

ammario avatar Oct 20 '22 02:10 ammario

This issue is becoming stale. In order to keep the tracker readable and actionable, I'm going close to this issue in 7 days if there isn't more activity.

github-actions[bot] avatar Jan 04 '23 00:01 github-actions[bot]

Since startup logs can run for multiple agents, I think we can add some sort of tabs for it.

Screen Shot 2023-01-05 at 16 08 49

BrunoQuaresma avatar Jan 05 '23 19:01 BrunoQuaresma

What would the workspace status be if startup scripts are still running?

kylecarbs avatar Jan 06 '23 16:01 kylecarbs

I think MVP here should be just putting the logs in the UI, and then in a follow-up feature we address waiting for startup scripts to finish. I believe some templates have startup scripts that block, e.g. they run code-server. Thus, a special workspace status for startup scripts could be a breaking / confusing change for existing deployments.

ammario avatar Jan 06 '23 17:01 ammario

This is a good question. I don't know, I think it could be wither running or pending. However, I don't think this influences showing the startup logs on the build page too much.

BrunoQuaresma avatar Jan 06 '23 17:01 BrunoQuaresma

The discussion in #4677 is relevant to workspace status during startup (/shutdown). Relating to blocking in startup script, I think that should be avoided/discouraged and introduce something new, like coder_agent_service to start programs. This would allow more complex functionality like restarting services after X failed health checks and what not. (Perhaps a topic for a separate issue though).

I agree with @ammario that we limit this scope to simply storing/streaming the log. I suggest we apply compression (e.g. zstd via klauspost/compress) to the logs but do we do it pre or post- limiting output to 1MB? On the one hand pre-compression will ensure the output isn't usually truncated (in failure scenarios there could be a lot of output to sift through). On the other hand, compression + a for loop echoing the same thing could produce a very huge uncompressed log.

mafredri avatar Jan 10 '23 09:01 mafredri

If we compress the log on capture, wouldn't that make it difficult / impossible to stream the log piecewise in the future?

ammario avatar Jan 10 '23 17:01 ammario

@ammario I understand your concern. I think it's doable but we can punt it for now. It'll add some complexity and raises the question if we should just utilize postgres compression instead (for the storage layer). My main thought with suggesting it was to allow larger logs to be stored, but reduced bandwidth could be a nice bonus.

On the topic, it's possible to use zstd as a transparent compression layer via SaveTheRbtz/zstd-seekable-format-go (I've only tested that it works, no other experience with seekable). Assuming we can trust it's one frame/line (observed behavior), this would map pretty well to inserts in the database.

One simpler solution we may employ is to compact logs once they're done streaming and store it as a single bytea blob. Unfortunately, lib/pq would require us to fetch the entire blob, so we can't do streaming decompression for memory savings.

We could also just do compaction for the "historical" jobs, which might allow us to keep them around. I personally think they'd be useful for debugging purposes by both admins and users. (I.e. what's different today, this worked yesterday!)

mafredri avatar Jan 10 '23 19:01 mafredri

Right now logs older than 30 days are getting deleted automatically, so I don't think we have to worry about storage. An upper-bounded workspace would generate 30MB of logs per month.

And good points on the lack of seekability in the pq driver. That is unfortunate.

ammario avatar Jan 11 '23 17:01 ammario

Removing myself from this until we have the BE done. It makes it easier for me to see what issues I should focus on.

BrunoQuaresma avatar Jan 31 '23 17:01 BrunoQuaresma

@ammario I've got a few questions for implementing this functionality.

(One thing worth noting also is that "startup scripts that end" has been implemented, and our examples adjusted, see #5749.)

  1. Do we want a push or pull architecture, or both?
  2. Regarding proposed PrefixSuffixWriter, do we send the tail once startup script has ended or do we try to stream the tail continuously?

To expand on 1.

Push architecture pros/cons:

  • + Agent logs will always be available, even if the workspace is shut down
  • - N number of agents will simultaneously be sending logs to the server
  • - Server needs to monitor load and tell agents to back off if it gets too rough
  • - Causes load even when logs are not needed
  • - If we want to stream the "tail" part of a "limited" log, we have to either send & delete or wait until e.g. startup script has completed

Pull architecture pros/cons:

  • + No additional load except when someone is viewing the logs (e.g. workspace build logs, trying to SSH in before workspace is ready, etc.)
  • + Since we're always streaming logs we have more freedom with how we handle limits and streaming the tail part of the log
  • + We can do seeking to reveal any part of the log, if needed
  • - Logs are ephemeral and not available after workspace stop
  • + Can be extended to scrape logs by server at certain points

If we go with push, do we go with (d)RPC to reduce load?

If we go with pull, would this be a suitable addition to the "statistics server" on the agent @kylecarbs? We'd probably rename it though. Reason I ask is that I worry about allocating too many listen ports, we've not reserved that many.

And re: 2., assuming we go with push, I imagine an implementation where we allow straight off 1MB(-64KB) to be streamed and stored, then the remaining "tail" is stored in a circular buffer of, say, 64KB and will be committed once the startup script exits. Does this match what you had in mind @ammario?

The downside of this is that the user will not see what's currently happening in the log (which is an edge case since they'd exceeded the ~1MB limit).

The alternative of streaming the tail part doesn't really prevent DoS.


As for /tmp/coder-agent.log, we might to break that out into a separate issue. For instance, if we go with push, we need to decide what parts of the logs are sent/kept. For instance, we could consider "all logs from start until workspace lifecycle = ready" to be included in the build logs. Likewise once we add shutdown, all logs from lifecycle = shutting_down -> stopped could also be added to the build log.

None of that would allow inspecting the current log, though. That'd still fall under pull category. As for why we need to break it up into parts is that my workspace log after running for over a day is 3.8MB.

mafredri avatar Feb 01 '23 13:02 mafredri

I think the pull architecture is the way to go:

  • We can avoid PrefixSuffixWriter, always giving the user the full log
  • We can seek, efficiently serving massive logs to the frontend
  • The code is significantly simpler

I also think we should only have a single listen server on the agent, as is typical of most daemons.

ammario avatar Feb 01 '23 17:02 ammario

Closing as completed by #6558.

bpmct avatar Mar 28 '23 13:03 bpmct