No frame will run for longer than ~5 minutes
Every frame that runs for longer than 5 minutes gets killed by rqd at 5-minutes-and-change.
I was trying to track this down in our (barely customized) fork of OpenCue, but I tested with a vanilla dockerized OpenCue and got the same result. Here's a log from that system for a frame running under rqd. The command is simply /bin/sleep 1000:
===========================================================
RenderQ JobSpec Fri Sep 16 14:48:13 2022
proxy rqd.rqnetwork.RunningFrame/ea49ae9f-48df-4513-82d0-3b1216dff95c -t:tcp -h 172.18.0.5 -p 10021
command /bin/sleep 1000
uid 38859
gid 20
logDestination /tmp/rqd/logs/testtwo/test_shot/logs/testtwo-test_shot-donal_helloworld5--0e4f18fd-a1ae-4277-a343-0c608eb96e65/testtwo-test_shot-donal_helloworld5.0001-test_layer.rqlog
cwd /tmp/testtwo-test_shot-donal_helloworld5/0001-test_layer
renderHost 172.18.0.5
jobId 0e4f18fd-a1ae-4277-a343-0c608eb96e65
frameId ea49ae9f-48df-4513-82d0-3b1216dff95c
env CUE3=1
env CUE_CHUNK=1
env CUE_FRAME=0001-test_layer
env CUE_FRAME_ID=ea49ae9f-48df-4513-82d0-3b1216dff95c
env CUE_GPUS=0
env CUE_GPU_MEMORY=0
env CUE_IFRAME=1
env CUE_JOB=testtwo-test_shot-donal_helloworld5
env CUE_JOB_ID=0e4f18fd-a1ae-4277-a343-0c608eb96e65
env CUE_LAYER=test_layer
env CUE_LAYER_ID=667ebc0c-73f5-4cc7-8ce8-67cc3affaf2a
env CUE_LOG_PATH=/tmp/rqd/logs/testtwo/test_shot/logs/testtwo-test_shot-donal_helloworld5--0e4f18fd-a1ae-4277-a343-0c608eb96e65
env CUE_MEMORY=3355443
env CUE_RANGE=1
env CUE_SHOT=test_shot
env CUE_SHOW=testtwo
env CUE_THREADABLE=0
env CUE_THREADS=1
env CUE_USER=donal
env HOME=/net/homedirs/donal
env LOGNAME=donal
env MAIL=/usr/mail/donal
env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
env SP_NOMYCSHRC=1
env TERM=unknown
env TZ=PST8PDT
env USER=donal
env frame=0001-test_layer
env jobhost=172.18.0.5
env jobid=testtwo-test_shot-donal_helloworld5
env logfile=testtwo-test_shot-donal_helloworld5.0001-test_layer.rqlog
env maxframetime=0
env mcp=1
env minspace=200
env shot=test_shot
env show=testtwo
env zframe=0001-test_layer
===========================================================
===========================================================
RenderQ Job Complete
exitStatus 1
exitSignal 9
killMessage OpenCue could not verify this frame.
startTime Fri Sep 16 14:48:13 2022
endTime Fri Sep 16 14:53:59 2022
maxrss 7760
maxUsedGpuMemory 0
utime 0
stime 0
renderhost 172.18.0.5
===========================================================
I speculated that rqd is killing frames that stop producing output on stdout/stderr for some time, so mostly I've been testing with a Python script that prints out progress tokens at 10-second intervals. It fails in the same way at roughly the same time.
I can add more information (rqd and cuebot log output, etc) to the ticket, but first of all... can someone else please try to reproduce this behaviour on a standard vanilla setup?
I gave this a try in my local environment. It's not exactly a vanilla setup -- it's master plus a bunch of other fixes I have in progress at the moment to get things working on my M1 mac. But a test job with sleep 1000 finished as expected.
I just tried this again with the dockerized config on another machine and it failed before hitting 6 minutes, so I'm becoming more confident that it's a bug in 0.15.22 (and I suspect it's still in master).
Assuming that this has been fixed in your branch - can you please give me any guidance on when that might get merged Brian? Or if that branch is public I could check it out myself?
Just noticed -- I see in that log you posted:
killMessage OpenCue could not verify this frame
That message means that Cuebot killed the frame because it thought the host shouldn't be working on that frame anymore.
- In the database, there's a record of every host and proc, and what is assigned to each proc.
- RQD sends periodic reports on what each proc is working on.
- When Cuebot receives that report, for each proc it checks the database to see if that proc is working on the correct frame. If it's not, it sends a request to RQD to kill the frame.
https://github.com/AcademySoftwareFoundation/OpenCue/blob/cd0fc3d9c32c1af30efe02409c765b26e9947ce4/cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java#L661-L662
https://github.com/AcademySoftwareFoundation/OpenCue/blob/cd0fc3d9c32c1af30efe02409c765b26e9947ce4/cuebot/src/main/java/com/imageworks/spcue/dispatcher/HostReportHandler.java#L725-L728
So now the question is, why is Cuebot reassigning that frame?
There are some scheduler changes in master now that have come since that last release, like #1103. I'd be interested to know if Cuebot from master shows the same behavior.
@DiegoTavares Any thoughts here? Seen this before?
BTW I'm working to get a new release out ASAP. A few issues have unfortunately slowed this down -- #1192 in particular has been breaking our CI pipelines that we use to test and release, and #1204 has been slowing me down in particular. Hoping to get these all resolved soon.
I noticed in earlier tests that when the rqd node is working on the frame, there is no record in the host_local table. Is that expected?
In the test scenario I had only one rqd node, and I'm speculating a bit here, but it looked like cuebot did not realise the frame was being worked on. It tried to launch it on a node. There is only one active node, so it tries to launch it on the same node it's actually already running on. That node returns an error, and cuebot asks it to kill the frame.
I believe I've seen this pattern, as well as the pattern you just outlined. I think I got confused because I was not always seeing the same messages in the rqd log when the frame failed. Sometimes it would receive a launch request which would provoke the error and result in a kill request, and sometimes it would seem to just receive the kill request for no obvious reason.
It sounds as though the 'no obvious reason' could have been prompted by rqd reporting back to cuebot, as you've outlined.
Thanks @bcipriano
I tried another sleep 1000 job to check out the state of the database. Attaching a Cuebot log here for you to compare against.
While the job was running I had all of the appropriate rows in frame, host, and proc tables. I did not have a record in the host_local table. I believe that table is only used for "local dispatch" jobs, which TBH I'm not too familiar with.
A few notes on that log file:
- The
failed to obtain information for proc running on frameandstopping framemessages at the beginning appear to be because I retried a previous job, rather than launching a new one. I think they can be ignored. - I can see the normal
starting frameandcreating procmessages when each frame starts. - The
verified testing-sh01-cipriano_sleep1/0001-sleep on 172.17.0.3 by grace periodare because Cuebot skips the host verification I outlined above for new frames. This is why your frames aren't getting killed right away. 172.17.0.3 doesn't have enough idle cores, 0 needs 10just means that the host doesn't have space for any more frames other than the running ones, also not an issue.- The
!!!! Inside getPROCS!!!!!messages looked alarming but it looks like those are just debugging messages that we missed during some merge. They don't indicate an actual error. - After the frames finish I see Cuebot cleaning up the frames/jobs/procs as expected.
After the job finished my proc table was empty.
What is the content of your proc table after the job has been running for like a minute? I'm guessing that that table is not being populated correctly?