Add startup logs to build log UI
I have a complex startup script:
resource "coder_agent" "dev" {
arch = "amd64"
os = "linux"
startup_script = <<EOF
#!/bin/sh
# install and start code-server
curl -fsSL https://code-server.dev/install.sh | sh
code-server --auth none --port 13337
sudo service docker start
if [ -f ~/personalize ]; then ~/personalize 2>&1 | tee ~/.personalize.log; fi
EOF
}
And it's not clear to me where I could see its log output. It would also be nice to not have to redirect command execution to logs manually.
The natural place for me to find this is the build log on the dashboard, but I also expect some file in my workspace to contain the log.
While we're at it, we should include the agent logs (/var/log/coder-agent.log) in the dashboard build log.
So, the log is in /var/log/coder-startup-script.log. I had to read the documentation to find that, so the issue still stands as a major usability improvement.
We probably want to rotate on each workspace start. Or we can save that for lifecycle hooks.
Let's make sure we use a PrefixSuffixWriter (First N bytes + Last N bytes) to bound how much storage these logs can take. We also may need the same for the provisioner logs, but I haven't investigated the code.
An implementation challenge is startup_script can block for an indeterminate amount of time. Some of our examples spin off code-server asynchronously, but others block.
Let's make sure we use a PrefixSuffixWriter (First N bytes + Last N bytes) to bound how much storage these logs can take. We also may need the same for the provisioner logs, but I haven't investigated the code.
The only issue is if we're expecting some sort of prompt to be written to the log (something I know @kylecarbs mentioned when we discussed this Monday) then it might get truncated by the PrefixSuffixWriter.
We could potentially only keep the startup logs of the most recent build and maybe also use a PrefixSuffixWriter with a 1MB cap to prevent any maliciousness. That would more than likely solve 99% of cases and prevent us from ballooning over time.
The startup logs of previous builds are not likely to be useful.
Some kind of limiter is absolutely necessary to prevent DoS, e.g cat /dev/urandom.
We could potentially only keep the startup logs of the most recent build and maybe also use a PrefixSuffixWriter with a 1MB cap to prevent any maliciousness. That would more than likely solve 99% of cases and prevent us from ballooning over time.
This sounds good to me.
This issue is becoming stale. In order to keep the tracker readable and actionable, I'm going close to this issue in 7 days if there isn't more activity.
Since startup logs can run for multiple agents, I think we can add some sort of tabs for it.
What would the workspace status be if startup scripts are still running?
I think MVP here should be just putting the logs in the UI, and then in a follow-up feature we address waiting for startup scripts to finish. I believe some templates have startup scripts that block, e.g. they run code-server. Thus, a special workspace status for startup scripts could be a breaking / confusing change for existing deployments.
This is a good question. I don't know, I think it could be wither running or pending. However, I don't think this influences showing the startup logs on the build page too much.
The discussion in #4677 is relevant to workspace status during startup (/shutdown). Relating to blocking in startup script, I think that should be avoided/discouraged and introduce something new, like coder_agent_service to start programs. This would allow more complex functionality like restarting services after X failed health checks and what not. (Perhaps a topic for a separate issue though).
I agree with @ammario that we limit this scope to simply storing/streaming the log. I suggest we apply compression (e.g. zstd via klauspost/compress) to the logs but do we do it pre or post- limiting output to 1MB? On the one hand pre-compression will ensure the output isn't usually truncated (in failure scenarios there could be a lot of output to sift through). On the other hand, compression + a for loop echoing the same thing could produce a very huge uncompressed log.
If we compress the log on capture, wouldn't that make it difficult / impossible to stream the log piecewise in the future?
@ammario I understand your concern. I think it's doable but we can punt it for now. It'll add some complexity and raises the question if we should just utilize postgres compression instead (for the storage layer). My main thought with suggesting it was to allow larger logs to be stored, but reduced bandwidth could be a nice bonus.
On the topic, it's possible to use zstd as a transparent compression layer via SaveTheRbtz/zstd-seekable-format-go (I've only tested that it works, no other experience with seekable). Assuming we can trust it's one frame/line (observed behavior), this would map pretty well to inserts in the database.
One simpler solution we may employ is to compact logs once they're done streaming and store it as a single bytea blob. Unfortunately, lib/pq would require us to fetch the entire blob, so we can't do streaming decompression for memory savings.
We could also just do compaction for the "historical" jobs, which might allow us to keep them around. I personally think they'd be useful for debugging purposes by both admins and users. (I.e. what's different today, this worked yesterday!)
Right now logs older than 30 days are getting deleted automatically, so I don't think we have to worry about storage. An upper-bounded workspace would generate 30MB of logs per month.
And good points on the lack of seekability in the pq driver. That is unfortunate.
Removing myself from this until we have the BE done. It makes it easier for me to see what issues I should focus on.
@ammario I've got a few questions for implementing this functionality.
(One thing worth noting also is that "startup scripts that end" has been implemented, and our examples adjusted, see #5749.)
- Do we want a push or pull architecture, or both?
- Regarding proposed PrefixSuffixWriter, do we send the tail once startup script has ended or do we try to stream the tail continuously?
To expand on 1.
Push architecture pros/cons:
- + Agent logs will always be available, even if the workspace is shut down
- - N number of agents will simultaneously be sending logs to the server
- - Server needs to monitor load and tell agents to back off if it gets too rough
- - Causes load even when logs are not needed
- - If we want to stream the "tail" part of a "limited" log, we have to either send & delete or wait until e.g. startup script has completed
Pull architecture pros/cons:
- + No additional load except when someone is viewing the logs (e.g. workspace build logs, trying to SSH in before workspace is ready, etc.)
- + Since we're always streaming logs we have more freedom with how we handle limits and streaming the tail part of the log
- + We can do seeking to reveal any part of the log, if needed
- - Logs are ephemeral and not available after workspace stop
- + Can be extended to scrape logs by server at certain points
If we go with push, do we go with (d)RPC to reduce load?
If we go with pull, would this be a suitable addition to the "statistics server" on the agent @kylecarbs? We'd probably rename it though. Reason I ask is that I worry about allocating too many listen ports, we've not reserved that many.
And re: 2., assuming we go with push, I imagine an implementation where we allow straight off 1MB(-64KB) to be streamed and stored, then the remaining "tail" is stored in a circular buffer of, say, 64KB and will be committed once the startup script exits. Does this match what you had in mind @ammario?
The downside of this is that the user will not see what's currently happening in the log (which is an edge case since they'd exceeded the ~1MB limit).
The alternative of streaming the tail part doesn't really prevent DoS.
As for /tmp/coder-agent.log, we might to break that out into a separate issue. For instance, if we go with push, we need to decide what parts of the logs are sent/kept. For instance, we could consider "all logs from start until workspace lifecycle = ready" to be included in the build logs. Likewise once we add shutdown, all logs from lifecycle = shutting_down -> stopped could also be added to the build log.
None of that would allow inspecting the current log, though. That'd still fall under pull category. As for why we need to break it up into parts is that my workspace log after running for over a day is 3.8MB.
I think the pull architecture is the way to go:
- We can avoid PrefixSuffixWriter, always giving the user the full log
- We can seek, efficiently serving massive logs to the frontend
- The code is significantly simpler
I also think we should only have a single listen server on the agent, as is typical of most daemons.
Closing as completed by #6558.