quad-ci
quad-ci copied to clipboard
Are we missing the last log lines from containers?
First, congrats on the book! π
I'm following along the book and I have an issue with collecting logs. Specifically, a step like this passes:
makeStep "Long step" "ubuntu" ["echo Hello", "sleep 2", "echo world", "sleep 2"]
But a step like this does not:
makeStep "Long step" "ubuntu" ["echo Hello", "sleep 2", "echo world"]
I added tracing to runBuild_, and it looks like there might be an issue where the last logs from a container are dropped on the floor. Specifically, I'm suspicious of this logic:
https://github.com/alpacaaa/quad-ci/blob/6da682dcf069ebd9248396c725ede6c2fb6345b0/src/Core.hs#L179-L185
It looks to me that, if we are CollectingLogs, but the step we are looking at is BuildFinished, we transition into CollectionFinished. Then, the next time we call runCollection:
https://github.com/alpacaaa/quad-ci/blob/6da682dcf069ebd9248396c725ede6c2fb6345b0/src/Core.hs#L204-L207
β¦we return pure []. However, this is where I think we should still call fetchLogs once, because, after all, the container might have still printed something between the time we last fetched logs and the time it exited.
Does the above make sense? It is completely possible I have a mistake in my code, and if I do, my apologies for wasting your time.
Hey Ilmari!
I think you're onto something, this smells like a bug. I'll take a proper look at it and get back to you, thanks for reporting and following along! :)
I can confirm there's a race condition, very good catch! π
It can be reproduced fairly reliably by reducing the sleep time in https://github.com/alpacaaa/quad-ci/blob/3dcbc81be6758f11476a1fefad69d83c7108ae2b/src/Runner.hs#L40
On my machine (but I suspect this is hardware dependent) getting it down to 200ms will get the test to fail about 50% of the times. So how do we fix it? I played around with it a bit but couldn't fix it in a straightforward way. I might have another go at it shortly.
@wolverian thanks for reporting, if you have a fix ready I'd love to take a look :)
Nope, I don't have a fix yet. π I decided to read the rest of the book first. I finished it yesterday. π My initial thought was to add a BuildCleanup state, which would gather any leftover logs.
Looking at it a bit more, perhaps the cleanup state should be on CollectionStatus instead, since it's step-specific.
Congrats on finishing the book! π
I'm starting to think the Docker Engine api can't process logs fast enough in all cases, and will fail to return logs if asked too soon. I've been debugging for the past hour and I think the logic is correct. I traced all the requests made to the api with their result, obtaining something like this:
"/containers/3c65cfbf66cd254ec23df3660289564ffd83035b8f7159f1df5b93c36875c955/logs?stdout=true&stderr=true&since=0&until=1623162716"
"OUTPUT + echo hello\r\nhello\r\n+ sleep 2\r\n"
"---------"
"/containers/3c65cfbf66cd254ec23df3660289564ffd83035b8f7159f1df5b93c36875c955/logs?stdout=true&stderr=true&since=1623162716&until=1623162716"
"OUTPUT"
"---------"
"/containers/3c65cfbf66cd254ec23df3660289564ffd83035b8f7159f1df5b93c36875c955/logs?stdout=true&stderr=true&since=1623162716&until=1623162717"
"OUTPUT"
Some of these calls yield an empty output, which should be fine, we can't expect all of them to return something. But crucially, repeating one of the requests that returned no output through curl will return data.
curl --unix-socket /var/run/docker.sock "http:/v1.40/containers/3c65cfbf66cd254ec23df3660289564ffd83035b8f7159f1df5b93c36875c955/logs?stdout=true&stderr=true&since=1623162716&until=1623162717"
+ echo world
world
That's pretty wild! In fairness, it would make sense for there being a delay between a container outputting something and the daemon ingesting it and making it available through the api. So if you request it too soon, you might not get anything back.
Hopefully that makes sense? I don't think there's a straightforward solution (I'm also on a not-exactly-recent build of the Docker app, so this might have been fixed). One cheeky fix could be not updating the since parameter in the CollectingLogs constructor when the previous log output was "". In other words, we cheat a bit and will not move the since param forward until we get a request that successfully returns data. That's a bit of a hack though and I'm not sold on it.
This is where I'm at! If you have any thoughts please discuss. π
That's super interesting! Do you think that's the cause of the issue I'm having (dropping the last logs from containers before they die), or an orthogonal issue that can cause additional loss?
I suspect it's the same issue. From what I can tell, the only way to get this specific test to fail is for the last log lines to not get collected. I think that explains why adding sleep 2 as a last step solves the problem -- because it gives the daemon enough time to catch up and expose the logs through the api (that's my main assumption at least).
We could potentially split the collecting of logs in its own thread: that would allow to delay collection by a tiny bit, but most importantly it would allow for the collection to happen less frequently (say 2 secs) than the main runner loop (which could stay at 1 sec or further lowered down to 200ms as I did in my tests).
I think that should solve the issue you're facing. I'll see if there's a nice, terse way of implementing this and get back to you!
I suspected there may be an additional issue because, when I looked at it, our state machine collecting logs does not run a collection after the container exited, which means there can always be missed lines from between the last collection and the container exit. Moving log collection to its own thread would mitigate this, but it wouldnβt completely remove the race condition. However, this is only true if I understand the state machine correctly. I suspect I donβt. I honestly need to look at the code more. π
I see where you're coming from, there still might be a race condition I'm not taking into account π. I'll try to come up with a slightly different design and check back!