readthedocs.org
readthedocs.org copied to clipboard
Build: exception with build while rclone sync is running
A user noticed a build failure which seems to be the repository contents causing some error with rclone:
Sentry Issue: READTHEDOCS-ORG-RYA
CalledProcessError: Command '['rclone', 'sync', '--transfers=8', '--checksum', '--verbose', '--retries=3', '--retries-sleep=1s', '--', '/home/docs/checkouts/readthedocs.org/user_builds/bt-tools/checkouts/latest/_readthedocs/html', ':s3:readthedocs-media-prod/html/bt-tools/latest']' returned non-zero exit status 1.
File "readthedocs/projects/tasks/builds.py", line 946, in store_build_artifacts
build_media_storage.rclone_sync_directory(from_path, to_path)
File "readthedocs/builds/storage.py", line 141, in rclone_sync_directory
return self._rclone.sync(source, destination)
File "readthedocs/storage/rclone.py", line 124, in sync
return self.execute("sync", args=[source, self.get_target(destination)])
File "readthedocs/storage/rclone.py", line 101, in execute
result = subprocess.run(
Error copying to storage
I tried this locally, but didn't see much more in the ways of stderr or rclone error output, so we will need to dig a little deeper into the command execution.
Here is a repository that reliable triggers this exception:
https://github.com/boschresearch/bt_tools
I'm going to put this on next sprint so we don't forget about it. It's not a common bug, however the failure case isn't great as it is a silent failure to the user.
This is not a silent failure to the user. We are failing the builds if that happens:
https://github.com/readthedocs/readthedocs.org/blob/6aa2757426d2838557bae2c117780f39d6116d24/readthedocs/projects/tasks/builds.py#L945-L961
They should see "Error uploading files to the storage." error in the build details page.
Seems like that is just our logging though, not reporting/notification to the user.
From the attached conversation, this is the build the user reported:
https://app.readthedocs.org/projects/bt-tools/builds/25303128/
There's no failure mentioned there besides the generic failure:
At least there is a notification, so not silent technically, but there is no indication of what actually failed there.
Yeah, BuildApp exception always shows the generic one by design. We don't want to expose the user with cryptic rclone related or similar issues. They are internal issues.
We should mention something to the user though, not just a generic failure. It indeed does not need to mention rclone, but does need to specifically call out what the user is noticing. I would not mention "Error uploading ..." as uploading is a background technical implementation, without the user knowing what is happening in this step.
Instead, what we should communicate to the user is that their documentation was not updated, or might have only been partially updated, and to resolve the issue they can try rebuilding again.
In this particular case, retrying will not fix the problem though.
we should communicate to the user is that their documentation was not updated, or might have only been partially updated, and to resolve the issue they can try rebuilding again
Good point 👍
We will need to create a specific exception with message_id and notification for this. Then, handle it at https://github.com/readthedocs/readthedocs.org/blob/6aa2757426d2838557bae2c117780f39d6116d24/readthedocs/projects/tasks/builds.py#L480-L483 to check if the exception has a message_id or not.
The problem is that the user is copying the project's root directory itself into the build
https://github.com/boschresearch/bt_tools/blob/2069360ae9ad7f0665008ea8b8020484d928c8f3/docs/source/conf.py#L26-L26
Since we output the docs at the project's root (_readthedocs/html), is copying the files over and over again, that results in files like _readthedocs/html/_readthedocs/html/... that exceed the max length of a file name allowed by S3.
I see that the user is already trying to exclude the build directory
https://github.com/boschresearch/bt_tools/blob/2069360ae9ad7f0665008ea8b8020484d928c8f3/docs/source/conf.py#L29-L36
But since our build directory is different, isn't being excluded. The proper fix should be to be explicit about the directories to include, instead of copying the whole root directory.
So, this isn't a bug in our side, so we still want to show a better error when rclone fails?
I think we need two different solutions here:
- show a better error when rclone fails (this issue)
- try detect the recursion in the directory and show a better message
But since our build directory is different, isn't being excluded. The proper fix should be to be explicit about the directories to include, instead of copying the whole root directory.
I understand this fix is on their side, right?
try detect the recursion in the directory and show a better message
Not sure if we can do that reliably, checking the length of the files may be better, but not sure if it's worth doing, as this isn't a common case as far as I can tell.
Yeah, I'm not sure it's super important to give an exact error message for this scenario, just that we shouldn't give such a generic one. Something closer to:
There was a problem updating your hosted documentation. Double check that you are outputting files to the correct path and try your build again.
Anything to at least give them an idea of where to start debugging. This isn't a common failure, we don't need to give exact suggestions here yet.