readthedocs.org
readthedocs.org copied to clipboard
Build: improve `upload` step
We are using Django storages to delete/upload/sync the files from the builders into S3. This is good because it uses the same Django storages' API, without matter what's the backend behind the scenes.
However, S3/the API does not support "in bulk" uploads/deletion. So, there a lot of API requests have to be done to delete/upload a full directory. The code that does this is at https://github.com/readthedocs/readthedocs.org/blob/0a9afda92e38331d21f6915909818be0f7e74e17/readthedocs/builds/storage.py#L48-L136
This amount of API requests make the upload process slow. In particular, when there are many files. We talked about improving this by using something like rclone (https://rclone.org/) or similar. There is a django-rclone-storage (https://pypi.org/project/django-rclone-storage/) where we can get some inspiration for this.
Slightly related to: https://github.com/readthedocs/readthedocs.org/issues/9179
I created a Metabase question to know what are the project that are more affected by this problem. These projects are those that outputs a lot of files when building their documentation. We track the amount of HTML files, not all the files, but at least it gives some insights: https://ethicalads.metabaseapp.com/question/254-projects-with-many-html-files
Does rclone do anything different here? I ask because rclone uses the same API endpoints storages uses, though I'm not familiar with what rclone is actually doing in those calls. But when I've used rclone and S3, it seems to performs similar API calls for every file in a directory to see if the file exists, and then to see if the file needs updated, before finally sending the file.
It could provide some benefit with threading, or perhaps checksum matching, but as far as I know, rclone still has to send at least one API call per file in a directory. I don't believe rclone does anything special to upload a directory in bulk. I believe this is what S3 Batch Operations are for, but they seemed rather complex (it involves Lambda).
It does work differently to some degree. We are not performing any kind of timestamp/hash check before uploading a file. So, we are always uploading a file even if it didn't changed
https://github.com/readthedocs/readthedocs.org/blob/7de0e35a351cdf0a9a41fc8e644445f0d460346f/readthedocs/builds/storage.py#L94-L103
Our current BuildMediaStorageMixin.sync_directory is pretty simple and not too smart. It seems that rclone works in a similar way, but we can reduce some time by avoid re-uploading already existent files. They have a whole section about "Reducing Costs" that we can take advantage of https://rclone.org/s3/#reducing-costs
We could do a quick test with one of the projects listed on the Metabase question using our current approach and compare the time using rclone. That will give us some insights about the this.
Yeah, the overall API count will be about equal I imagine, but checksum matching could be a place where there are gains. Doing a manual test first would be a great idea.
We just deployed the log line that will give us this data. I created a query in New Relic that gives us the projects spending more than 60 seconds on the "Uploading" step: https://onenr.io/02wdVM279jE
For now, ~3 minutes is the maximum we have. Which doesn't seem terrible. I think it makes sense comparing that time with the build time (that I don't have in that log line) to get the % of time used on each step. We could get that time, but it will require some extra work.
I will take another look at this in the following days when we have more data and see if I find projects with worse scenarios. Then, we could use those projects for the rclone tests we talked and compare times.
I executed this query again taking into account 7 days ago and I found that "Read the Docs for Business" users are the most affected by this: https://onenr.io/0PwJKzg4gR7.
I checked the first result that took ~900 seconds in the "Uploading" step. The build took 2631 seconds in total and it has a lot of images. On these cases, using rclone will be pretty helpful since we won't upload these images over and over again.
This feature may be of particular interested for the Python's documentation community. See https://github.com/python/docs-community/issues/10#issuecomment-1067272072
For next steps, we should see what rclone timing looks like. An easy first step would be manually replicating the rclone command from production to our dev/prod s3 bucket -- perhaps for one of the problematic repos.
I think we can assume that rclone run manually would be faster, though the next question would be how much faster is the rclone-storages implementation? I suspect there might be additional operations in that approach, compared to just a raw rclone command.
rclone storages uses "subprocess. Popen", so I think it won't be differences 😬
It runs in multiple processes, so it will definitely be faster. We need something that uploads with a decent bit of concurrency, whether it's Python or Go.
I wrote that comment in the original post about deleting multiple files and I don't think there is a way to efficiently do it in django-storages. However, the AWS API definitely supports deleting multiple objects in a single API call (docs ref) so it should be possible to do this more efficiently. Likely it's possible to upload multiple files efficiently as well.
So, are we in favor of using rclone or improving our code to handle bulk operations? Have we done the manual test with rclone yet?
S3 has the option of bulk deletion (up to 1k per request) https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_objects, it doesn't have option for bulk upload, but we can make use of multi threading for that.
Also, aws-cli also has a sync command https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html.
@stsewd
So, are we in favor of using rclone or improving our code to handle bulk operations? Have we done the manual test with rclone yet?
We haven't decided yet what's the right tool, but we guess rclone will be way better. We haven't done the tests, no. The next step is to do those manual tests with rclone [^1] using some of projects that takes the most time uploading the files from the New Relic query. Let me know if need some help to figure this out.
[^1]: and maybe aws-cli if you want try it in case you think it would be better/simpler
Yea, I think rclone is worth testing, but if it's easy to speed up our code that's probably better. If it's really complex to speed up our solution, we should just use an external one.
And the results.
setup
The chosen project was astropy, taking 322 seconds to sync in our application (takeing from NR). The tests start with a clean upload, and then test doing a resync with other versions to test performance when files change or are deleted, and even re-uploading the same version to see what happens when no files change.
aws s3 cp s3://readthedocs-media-prod/html/astropy/v5.2.x/ /tmp/astropy/v5.2.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/v5.1.x/ /tmp/astropy/v5.1.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/v2.0.x/ /tmp/astropy/v2.0.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/v1.0.x/ /tmp/astropy/v1.0.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/latest/ /tmp/astropy/latest/ --recursive
aws s3 sync
time aws s3 sync /tmp/astropy/v5.1.x/ s3://readthedocs-media-dev/html/astropy/latest/ --delete real 0m23.395s user 0m13.512s sys 0m2.168s
All other tests with awscli have been omitted, since they rely on the last modified time, and given that files will be re-generated on each build, the tool will re-upload everything.
The following sync command syncs files to a local directory from objects in a specified bucket and prefix by downloading > s3 objects. An s3 object will require downloading if one of the following conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last modified time of the local file.
https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html
test rclone
time rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 112.658 MiB / 112.658 MiB, 100%, 1.019 MiB/s, ETA 0s Transferred: 3942 / 3942, 100% Elapsed time: 1m4.7s
real 1m4.711s user 0m3.860s sys 0m1.399s
time rclone sync -v /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 111.094 MiB / 111.094 MiB, 100%, 3.058 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 3 (files), 0 (dirs) Transferred: 2539 / 2539, 100% Elapsed time: 44.9s
real 0m44.929s user 0m4.151s sys 0m1.443s
time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 96.770 MiB / 96.770 MiB, 100%, 2.376 MiB/s, ETA 0s Checks: 3980 / 3980, 100% Deleted: 7 (files), 0 (dirs) Transferred: 2111 / 2111, 100% Elapsed time: 45.0s
real 0m45.047s user 0m4.235s sys 0m1.377s
time rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 108.828 MiB / 108.828 MiB, 100%, 2.707 MiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 36 (files), 0 (dirs) Transferred: 2503 / 2503, 100% Elapsed time: 44.4s
real 0m44.506s user 0m4.267s sys 0m1.380s
time rclone sync -v /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 123.839 MiB / 123.839 MiB, 100%, 1.569 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 1203 (files), 0 (dirs) Transferred: 2724 / 2724, 100% Elapsed time: 51.9s
real 0m51.970s user 0m3.943s sys 0m1.165s
time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.121 MiB / 114.121 MiB, 100%, 1.786 MiB/s, ETA 0s Checks: 3252 / 3252, 100% Deleted: 514 (files), 0 (dirs) Transferred: 3452 / 3452, 100% Elapsed time: 1m2.2s
real 1m2.271s user 0m4.418s sys 0m1.458s
time rclone sync -v /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 35.893 MiB / 35.893 MiB, 100%, 716.606 KiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 2472 (files), 0 (dirs) Transferred: 1560 / 1560, 100% Elapsed time: 39.0s
real 0m39.047s user 0m2.709s sys 0m0.803s
time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.163 MiB / 114.163 MiB, 100%, 1.669 MiB/s, ETA 0s Checks: 1881 / 1881, 100% Deleted: 378 (files), 0 (dirs) Transferred: 3654 / 3654, 100% Elapsed time: 1m3.6s
real 1m3.660s user 0m4.358s sys 0m1.314s
time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 0 B / 0 B, -, 0 B/s, ETA - Checks: 3975 / 3975, 100% Elapsed time: 4.6s
real 0m4.635s user 0m1.316s sys 0m0.316s
test rclone --transfers=8
time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 112.658 MiB / 112.658 MiB, 100%, 3.396 MiB/s, ETA 0s Transferred: 3942 / 3942, 100% Elapsed time: 29.9s
real 0m29.927s user 0m3.699s sys 0m1.075s
time rclone sync -v --transfers=8 /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 111.094 MiB / 111.094 MiB, 100%, 5.113 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 3 (files), 0 (dirs) Transferred: 2539 / 2539, 100% Elapsed time: 22.5s
real 0m22.516s user 0m4.220s sys 0m1.229s
time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 96.770 MiB / 96.770 MiB, 100%, 4.924 MiB/s, ETA 0s Checks: 3980 / 3980, 100% Deleted: 7 (files), 0 (dirs) Transferred: 2111 / 2111, 100% Elapsed time: 20.0s
real 0m20.110s user 0m4.015s sys 0m1.100s
time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 108.828 MiB / 108.828 MiB, 100%, 5.049 MiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 36 (files), 0 (dirs) Transferred: 2503 / 2503, 100% Elapsed time: 22.1s
real 0m22.167s user 0m4.173s sys 0m1.265s
time rclone sync -v --transfers=8 /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 123.839 MiB / 123.839 MiB, 100%, 4.412 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 1203 (files), 0 (dirs) Transferred: 2724 / 2724, 100% Elapsed time: 26.7s
real 0m26.751s user 0m3.666s sys 0m1.246s
time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.121 MiB / 114.121 MiB, 100%, 3.966 MiB/s, ETA 0s Checks: 3252 / 3252, 100% Deleted: 514 (files), 0 (dirs) Transferred: 3452 / 3452, 100% Elapsed time: 29.7s
real 0m29.770s user 0m4.287s sys 0m1.322s
time rclone sync -v --transfers=8 /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 35.893 MiB / 35.893 MiB, 100%, 2.103 MiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 2472 (files), 0 (dirs) Transferred: 1560 / 1560, 100% Elapsed time: 17.2s
real 0m17.271s user 0m2.586s sys 0m0.713s
time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.163 MiB / 114.163 MiB, 100%, 3.150 MiB/s, ETA 0s Checks: 1881 / 1881, 100% Deleted: 378 (files), 0 (dirs) Transferred: 3654 / 3654, 100% Elapsed time: 33.8s
real 0m33.835s user 0m4.224s sys 0m1.277s
time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 0 B / 0 B, -, 0 B/s, ETA - Checks: 3975 / 3975, 100% Elapsed time: 4.6s
real 0m4.676s user 0m1.410s sys 0m0.213s
veredict
rclone increasing the number of parallel uploads (--transfers=8 ) is fast, obviously.
But since aws-cli doesn't work as expected, we don't have anything else to compare to.
We could also test what happens if we improve our code:
- Checking the file hash, so we don't re-upload it (we could even do this using threads).
- Doing deletions un bulk
- Using multiple threads to upload files
but if it's easy to speed up our code that's probably better. If it's really complex to speed up our solution, we should just use an external one.
I'm in favor of giving it a try.
Using an external command has its cons:
- Extra setup / installation steps
- Can't be easily tested (we would be mocking the whole command)
- Logging won't be that nice, since we would only have access to stdout
- Could be problematic using that command from some things and using the rest of the code for others (like if we make changes to our code, those won't be reflected in the external tool, like reading/writing/deleting a file)
@stsewd Great work. I do think it's worth moving forward with using rclone for this. I agree with your limitations though. I do know in the past we had a Syncer abstraction that we could reuse for this: https://github.com/readthedocs/readthedocs.org/pull/6535/files#diff-369f6b076f78f7e41570254c681a3871b455a192428d138f2eeda28dc2eaf8c3 -- that would allow us to keep things working as normal in local tests. But I think rclone can also easily support local <-> local file transfers, so maybe we don't need anything that complex?
I generally trust your ideas around implementation, so happy to move forward with what you think is best.
This is great! rclone is ~15x faster than our current approach! 💪🏼 Maybe it's possible to get the most from both approaches?
There is a
django-rclone-storage(https://pypi.org/project/django-rclone-storage/) where we can get some inspiration for this.
Why not giving a try at this? It still executed rclone command behind the scenes, but at least gives us a layer of abstraction using a Django storage that integrates great with our code and reduce the cons you have mentioned.
There is a django-rclone-storage
I also assumed we were talking about this solution as the first to try.
I'm rather hesitant to fiddle with threading in our own application implementation, or really trying to get too technically correct here. We don't have many projects being negatively affected by this issue, so our solution should match the severity.
A drop in replacement is a great option. Even if it's only twice as fast, that's good value for little effort.
django-rclone doesn't have a sync option, it shouldn't be hard to implement that option, but also, that package is really just a thin wrapper around rclone (https://github.com/ElnathMojo/django-rclone-storage/), so not sure if we need to introduce a new dependency just for that, I was thinking in using rclone just for the sync operation, and have the other parts rely on django-storages (which I kind of prefer, since it has a whole community behind it compated to the other package)
I haven't customized storages, so don't know how hard the replacement would be, but this still seems like a fair approach. A package does still seems easiest though, and I'm not dependency adverse here unless the project is unmaintained or something.
I'll defer to one of you all that have customized storages more. I'd have questions on our implementation if we're considering our own threading though. That's a rather expensive can of worms.
@agjohnson we would just need override (or replace it) this method https://github.com/readthedocs/readthedocs.org/blob/54b5e13f1d5ac576b05f180d1b2001e22141d596/readthedocs/builds/storage.py#L94-L94
in https://github.com/readthedocs/readthedocs.org/blob/54b5e13f1d5ac576b05f180d1b2001e22141d596/readthedocs/storage/s3_storage.py#L21-L21
All other operations would be handled by django-storage as usual
I think shelling out to rsync is probably fine? Especially since the existing package doesn't do what we want. It does seem a bit more explicit. We can try it out behind a feature flag, and if we find issues with it we can contribute something to the package?
Noting here that we should take into account the symlink issue that we found in the last weeks when swapping the backend. In #9800 all the logic was moved into safe_open (used for opening user's configuration files and also when uploading the files to the storage) which will help on this. We will need to use the same logic [^1] immediately before uploading the files with rclone
[^1]: check the symlink target is inside DOCROOT (or more specifically, the project's path if possible)