readthedocs.org
readthedocs.org copied to clipboard
Build: cancelling build doesn't cancel build
Noticed a report of this today, from the astropy docs. The Sphinx command is a long running process and doesn't seem to respond to the cancel request. I see 6 requests to cancel the build after the Sphinx process has started, and the process has yet to respond to the termination request.
I'm guessing the cancel function doesn't send a strong enough signal, or the process is not responding to the signal we're sending it. Alternatively, is this the expected limitation of our current implementation?
I did a bit more digging here and the build instance seems to have gone away due to a scaling in event during the middle of the build. At 6m after the Sphinx build command, the instance was terminated, but it looks like that didn't cause a clean shutdown of the instance and build processes.
Apr 8 20:12:19 build-large-i-0d89f7f8a6b4f4d38 readthedocs[1046]: [info ] Running build command in container. [readthedocs.doc_builder.environments:288] build_id=16603257 builder=build-large-i-0d89f7f8a6b4f4d38 command=/home/docs/checkouts/readthedocs.org/user_builds/astropy/envs/13063/bin/python -m sphinx -T -E -j auto -W --keep-going -b html -d _build/doctrees -D language=en . _build/html commit=2ab0232b3ee1fecd314137ee4eaedf6a81a576a6 container_id=build-16603257-project-1877-astropy cwd=/home/docs/checkouts/readthedocs.org/user_builds/astropy/checkouts/13063/docs integration_type=github_webhook project_slug=astropy request_id=b2ea41ef-bd17-460f-8372-5cb86fb7f0bd task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf version_slug=13063 webhook_event=pull_request
...
Instance is terminated by ASG at 20:16, build isn't updated and doesn't finish.
...
Apr 8 21:07:24 web-i-03a2e49d95e08c0df readthedocs[1320]: [warning ] Canceling build. [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=f3a84de5-6a6f-457e-8860-2ee60e8480d1 terminate=True version_slug=13063
Apr 8 21:07:36 web-i-01ac2a6c71e4451b0 readthedocs[1179]: [warning ] Canceling build. [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=baa27365-89e4-4b9c-8960-225fa1425751 terminate=True version_slug=13063
Apr 8 21:07:54 web-i-03a2e49d95e08c0df readthedocs[1317]: [warning ] Canceling build. [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=ef1cb986-7826-4935-a3c0-1ccc711efab3 terminate=True version_slug=13063
Apr 8 21:08:24 web-i-000cf40742ca0c677 readthedocs[50239]: [warning ] Canceling build. [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=f2e80192-947e-4e59-b3e1-638fee34715b terminate=True version_slug=13063
Apr 8 21:14:40 web-i-03a2e49d95e08c0df readthedocs[1317]: [warning ] Canceling build. [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=0f09407a-301f-45c8-8d5f-74c4aa1e4e3b terminate=True version_slug=13063
So the issue looks more like a scaling issue. Why didn't the instance shut down cleanly?
Had another user hit this, we should probably always mark the build as cancelled at https://github.com/readthedocs/readthedocs.org/blob/634f85bf00ffcab7bc91926cc9bf9cc394d375fb/readthedocs/builds/views.py#L174-L196
instead of relying on the task being revoked, but maybe there is a bug in our task that we should fix too.
I don't see in the logs this, so I don't think the signal is being received by the task at all.
https://github.com/readthedocs/readthedocs.org/blob/634f85bf00ffcab7bc91926cc9bf9cc394d375fb/readthedocs/projects/tasks/builds.py#L278-L280
@agjohnson
So the issue looks more like a scaling issue. Why didn't the instance shut down cleanly?
If it's a spot instance, we won't always shut it down cleanly. The instance is communicated to be shut down 2 minutes before it's killed. If the build is not completed in those 2 minutes, it will die.