readthedocs.org icon indicating copy to clipboard operation
readthedocs.org copied to clipboard

Build: cancelling build doesn't cancel build

Open agjohnson opened this issue 2 years ago • 3 comments

Noticed a report of this today, from the astropy docs. The Sphinx command is a long running process and doesn't seem to respond to the cancel request. I see 6 requests to cancel the build after the Sphinx process has started, and the process has yet to respond to the termination request.

I'm guessing the cancel function doesn't send a strong enough signal, or the process is not responding to the signal we're sending it. Alternatively, is this the expected limitation of our current implementation?

agjohnson avatar Apr 08 '22 21:04 agjohnson

I did a bit more digging here and the build instance seems to have gone away due to a scaling in event during the middle of the build. At 6m after the Sphinx build command, the instance was terminated, but it looks like that didn't cause a clean shutdown of the instance and build processes.

Apr  8 20:12:19 build-large-i-0d89f7f8a6b4f4d38 readthedocs[1046]: [info     ] Running build command in container. [readthedocs.doc_builder.environments:288] build_id=16603257 builder=build-large-i-0d89f7f8a6b4f4d38 command=/home/docs/checkouts/readthedocs.org/user_builds/astropy/envs/13063/bin/python -m sphinx -T -E -j auto -W --keep-going -b html -d _build/doctrees -D language=en . _build/html commit=2ab0232b3ee1fecd314137ee4eaedf6a81a576a6 container_id=build-16603257-project-1877-astropy cwd=/home/docs/checkouts/readthedocs.org/user_builds/astropy/checkouts/13063/docs integration_type=github_webhook project_slug=astropy request_id=b2ea41ef-bd17-460f-8372-5cb86fb7f0bd task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf version_slug=13063 webhook_event=pull_request
...
Instance is terminated by ASG at 20:16, build isn't updated and doesn't finish.
...
Apr  8 21:07:24 web-i-03a2e49d95e08c0df readthedocs[1320]: [warning  ] Canceling build.               [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=f3a84de5-6a6f-457e-8860-2ee60e8480d1 terminate=True version_slug=13063
Apr  8 21:07:36 web-i-01ac2a6c71e4451b0 readthedocs[1179]: [warning  ] Canceling build.               [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=baa27365-89e4-4b9c-8960-225fa1425751 terminate=True version_slug=13063
Apr  8 21:07:54 web-i-03a2e49d95e08c0df readthedocs[1317]: [warning  ] Canceling build.               [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=ef1cb986-7826-4935-a3c0-1ccc711efab3 terminate=True version_slug=13063
Apr  8 21:08:24 web-i-000cf40742ca0c677 readthedocs[50239]: [warning  ] Canceling build.               [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=f2e80192-947e-4e59-b3e1-638fee34715b terminate=True version_slug=13063
Apr  8 21:14:40 web-i-03a2e49d95e08c0df readthedocs[1317]: [warning  ] Canceling build.               [readthedocs.builds.views:182] build_id=16603257 build_task_id=41007314-4479-4cd6-a742-1c34c4f8cfbf project_slug=astropy request_id=0f09407a-301f-45c8-8d5f-74c4aa1e4e3b terminate=True version_slug=13063

So the issue looks more like a scaling issue. Why didn't the instance shut down cleanly?

agjohnson avatar Apr 08 '22 22:04 agjohnson

Had another user hit this, we should probably always mark the build as cancelled at https://github.com/readthedocs/readthedocs.org/blob/634f85bf00ffcab7bc91926cc9bf9cc394d375fb/readthedocs/builds/views.py#L174-L196

instead of relying on the task being revoked, but maybe there is a bug in our task that we should fix too.

I don't see in the logs this, so I don't think the signal is being received by the task at all.

https://github.com/readthedocs/readthedocs.org/blob/634f85bf00ffcab7bc91926cc9bf9cc394d375fb/readthedocs/projects/tasks/builds.py#L278-L280

stsewd avatar Jun 30 '22 22:06 stsewd

@agjohnson

So the issue looks more like a scaling issue. Why didn't the instance shut down cleanly?

If it's a spot instance, we won't always shut it down cleanly. The instance is communicated to be shut down 2 minutes before it's killed. If the build is not completed in those 2 minutes, it will die.

humitos avatar Jul 02 '22 10:07 humitos