readthedocs.org icon indicating copy to clipboard operation
readthedocs.org copied to clipboard

Build: improve `upload` step

Open humitos opened this issue 3 years ago • 9 comments
trafficstars

We are using Django storages to delete/upload/sync the files from the builders into S3. This is good because it uses the same Django storages' API, without matter what's the backend behind the scenes.

However, S3/the API does not support "in bulk" uploads/deletion. So, there a lot of API requests have to be done to delete/upload a full directory. The code that does this is at https://github.com/readthedocs/readthedocs.org/blob/0a9afda92e38331d21f6915909818be0f7e74e17/readthedocs/builds/storage.py#L48-L136

This amount of API requests make the upload process slow. In particular, when there are many files. We talked about improving this by using something like rclone (https://rclone.org/) or similar. There is a django-rclone-storage (https://pypi.org/project/django-rclone-storage/) where we can get some inspiration for this.

Slightly related to: https://github.com/readthedocs/readthedocs.org/issues/9179

humitos avatar Jul 25 '22 09:07 humitos

I created a Metabase question to know what are the project that are more affected by this problem. These projects are those that outputs a lot of files when building their documentation. We track the amount of HTML files, not all the files, but at least it gives some insights: https://ethicalads.metabaseapp.com/question/254-projects-with-many-html-files

humitos avatar Aug 30 '22 17:08 humitos

Does rclone do anything different here? I ask because rclone uses the same API endpoints storages uses, though I'm not familiar with what rclone is actually doing in those calls. But when I've used rclone and S3, it seems to performs similar API calls for every file in a directory to see if the file exists, and then to see if the file needs updated, before finally sending the file.

It could provide some benefit with threading, or perhaps checksum matching, but as far as I know, rclone still has to send at least one API call per file in a directory. I don't believe rclone does anything special to upload a directory in bulk. I believe this is what S3 Batch Operations are for, but they seemed rather complex (it involves Lambda).

agjohnson avatar Aug 30 '22 17:08 agjohnson

It does work differently to some degree. We are not performing any kind of timestamp/hash check before uploading a file. So, we are always uploading a file even if it didn't changed

https://github.com/readthedocs/readthedocs.org/blob/7de0e35a351cdf0a9a41fc8e644445f0d460346f/readthedocs/builds/storage.py#L94-L103

Our current BuildMediaStorageMixin.sync_directory is pretty simple and not too smart. It seems that rclone works in a similar way, but we can reduce some time by avoid re-uploading already existent files. They have a whole section about "Reducing Costs" that we can take advantage of https://rclone.org/s3/#reducing-costs

We could do a quick test with one of the projects listed on the Metabase question using our current approach and compare the time using rclone. That will give us some insights about the this.

humitos avatar Aug 31 '22 08:08 humitos

Yeah, the overall API count will be about equal I imagine, but checksum matching could be a place where there are gains. Doing a manual test first would be a great idea.

agjohnson avatar Aug 31 '22 15:08 agjohnson

We just deployed the log line that will give us this data. I created a query in New Relic that gives us the projects spending more than 60 seconds on the "Uploading" step: https://onenr.io/02wdVM279jE

For now, ~3 minutes is the maximum we have. Which doesn't seem terrible. I think it makes sense comparing that time with the build time (that I don't have in that log line) to get the % of time used on each step. We could get that time, but it will require some extra work.

I will take another look at this in the following days when we have more data and see if I find projects with worse scenarios. Then, we could use those projects for the rclone tests we talked and compare times.

humitos avatar Sep 06 '22 16:09 humitos

I executed this query again taking into account 7 days ago and I found that "Read the Docs for Business" users are the most affected by this: https://onenr.io/0PwJKzg4gR7.

I checked the first result that took ~900 seconds in the "Uploading" step. The build took 2631 seconds in total and it has a lot of images. On these cases, using rclone will be pretty helpful since we won't upload these images over and over again.

humitos avatar Sep 12 '22 10:09 humitos

This feature may be of particular interested for the Python's documentation community. See https://github.com/python/docs-community/issues/10#issuecomment-1067272072

humitos avatar Sep 15 '22 10:09 humitos

For next steps, we should see what rclone timing looks like. An easy first step would be manually replicating the rclone command from production to our dev/prod s3 bucket -- perhaps for one of the problematic repos.

I think we can assume that rclone run manually would be faster, though the next question would be how much faster is the rclone-storages implementation? I suspect there might be additional operations in that approach, compared to just a raw rclone command.

agjohnson avatar Sep 15 '22 18:09 agjohnson

rclone storages uses "subprocess. Popen", so I think it won't be differences 😬

humitos avatar Sep 15 '22 20:09 humitos

It runs in multiple processes, so it will definitely be faster. We need something that uploads with a decent bit of concurrency, whether it's Python or Go.

ericholscher avatar Nov 08 '22 19:11 ericholscher

I wrote that comment in the original post about deleting multiple files and I don't think there is a way to efficiently do it in django-storages. However, the AWS API definitely supports deleting multiple objects in a single API call (docs ref) so it should be possible to do this more efficiently. Likely it's possible to upload multiple files efficiently as well.

davidfischer avatar Nov 23 '22 17:11 davidfischer

So, are we in favor of using rclone or improving our code to handle bulk operations? Have we done the manual test with rclone yet?

S3 has the option of bulk deletion (up to 1k per request) https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_objects, it doesn't have option for bulk upload, but we can make use of multi threading for that.

Also, aws-cli also has a sync command https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html.

stsewd avatar Nov 29 '22 01:11 stsewd

@stsewd

So, are we in favor of using rclone or improving our code to handle bulk operations? Have we done the manual test with rclone yet?

We haven't decided yet what's the right tool, but we guess rclone will be way better. We haven't done the tests, no. The next step is to do those manual tests with rclone [^1] using some of projects that takes the most time uploading the files from the New Relic query. Let me know if need some help to figure this out.

[^1]: and maybe aws-cli if you want try it in case you think it would be better/simpler

humitos avatar Nov 29 '22 08:11 humitos

Yea, I think rclone is worth testing, but if it's easy to speed up our code that's probably better. If it's really complex to speed up our solution, we should just use an external one.

ericholscher avatar Nov 29 '22 19:11 ericholscher

And the results.

setup

The chosen project was astropy, taking 322 seconds to sync in our application (takeing from NR). The tests start with a clean upload, and then test doing a resync with other versions to test performance when files change or are deleted, and even re-uploading the same version to see what happens when no files change.

aws s3 cp s3://readthedocs-media-prod/html/astropy/v5.2.x/ /tmp/astropy/v5.2.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/v5.1.x/ /tmp/astropy/v5.1.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/v2.0.x/ /tmp/astropy/v2.0.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/v1.0.x/ /tmp/astropy/v1.0.x/ --recursive aws s3 cp s3://readthedocs-media-prod/html/astropy/latest/ /tmp/astropy/latest/ --recursive

aws s3 sync

time aws s3 sync /tmp/astropy/v5.1.x/ s3://readthedocs-media-dev/html/astropy/latest/ --delete real 0m23.395s user 0m13.512s sys 0m2.168s

All other tests with awscli have been omitted, since they rely on the last modified time, and given that files will be re-generated on each build, the tool will re-upload everything.

The following sync command syncs files to a local directory from objects in a specified bucket and prefix by downloading > s3 objects. An s3 object will require downloading if one of the following conditions is true:

The s3 object does not exist in the local directory.

The size of the s3 object differs from the size of the local file.

The last modified time of the s3 object is older than the last modified time of the local file.

https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html

test rclone

time rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 112.658 MiB / 112.658 MiB, 100%, 1.019 MiB/s, ETA 0s Transferred: 3942 / 3942, 100% Elapsed time: 1m4.7s

real 1m4.711s user 0m3.860s sys 0m1.399s

time rclone sync -v /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 111.094 MiB / 111.094 MiB, 100%, 3.058 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 3 (files), 0 (dirs) Transferred: 2539 / 2539, 100% Elapsed time: 44.9s

real 0m44.929s user 0m4.151s sys 0m1.443s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 96.770 MiB / 96.770 MiB, 100%, 2.376 MiB/s, ETA 0s Checks: 3980 / 3980, 100% Deleted: 7 (files), 0 (dirs) Transferred: 2111 / 2111, 100% Elapsed time: 45.0s

real 0m45.047s user 0m4.235s sys 0m1.377s

time rclone sync -v /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 108.828 MiB / 108.828 MiB, 100%, 2.707 MiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 36 (files), 0 (dirs) Transferred: 2503 / 2503, 100% Elapsed time: 44.4s

real 0m44.506s user 0m4.267s sys 0m1.380s

time rclone sync -v /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 123.839 MiB / 123.839 MiB, 100%, 1.569 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 1203 (files), 0 (dirs) Transferred: 2724 / 2724, 100% Elapsed time: 51.9s

real 0m51.970s user 0m3.943s sys 0m1.165s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.121 MiB / 114.121 MiB, 100%, 1.786 MiB/s, ETA 0s Checks: 3252 / 3252, 100% Deleted: 514 (files), 0 (dirs) Transferred: 3452 / 3452, 100% Elapsed time: 1m2.2s

real 1m2.271s user 0m4.418s sys 0m1.458s

time rclone sync -v /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 35.893 MiB / 35.893 MiB, 100%, 716.606 KiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 2472 (files), 0 (dirs) Transferred: 1560 / 1560, 100% Elapsed time: 39.0s

real 0m39.047s user 0m2.709s sys 0m0.803s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.163 MiB / 114.163 MiB, 100%, 1.669 MiB/s, ETA 0s Checks: 1881 / 1881, 100% Deleted: 378 (files), 0 (dirs) Transferred: 3654 / 3654, 100% Elapsed time: 1m3.6s

real 1m3.660s user 0m4.358s sys 0m1.314s

time rclone sync -v /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 0 B / 0 B, -, 0 B/s, ETA - Checks: 3975 / 3975, 100% Elapsed time: 4.6s

real 0m4.635s user 0m1.316s sys 0m0.316s

test rclone --transfers=8

time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 112.658 MiB / 112.658 MiB, 100%, 3.396 MiB/s, ETA 0s Transferred: 3942 / 3942, 100% Elapsed time: 29.9s

real 0m29.927s user 0m3.699s sys 0m1.075s

time rclone sync -v --transfers=8 /tmp/astropy/v5.2.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 111.094 MiB / 111.094 MiB, 100%, 5.113 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 3 (files), 0 (dirs) Transferred: 2539 / 2539, 100% Elapsed time: 22.5s

real 0m22.516s user 0m4.220s sys 0m1.229s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 96.770 MiB / 96.770 MiB, 100%, 4.924 MiB/s, ETA 0s Checks: 3980 / 3980, 100% Deleted: 7 (files), 0 (dirs) Transferred: 2111 / 2111, 100% Elapsed time: 20.0s

real 0m20.110s user 0m4.015s sys 0m1.100s

time rclone sync -v --transfers=8 /tmp/astropy/v5.1.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 108.828 MiB / 108.828 MiB, 100%, 5.049 MiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 36 (files), 0 (dirs) Transferred: 2503 / 2503, 100% Elapsed time: 22.1s

real 0m22.167s user 0m4.173s sys 0m1.265s

time rclone sync -v --transfers=8 /tmp/astropy/v2.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 123.839 MiB / 123.839 MiB, 100%, 4.412 MiB/s, ETA 0s Checks: 3942 / 3942, 100% Deleted: 1203 (files), 0 (dirs) Transferred: 2724 / 2724, 100% Elapsed time: 26.7s

real 0m26.751s user 0m3.666s sys 0m1.246s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.121 MiB / 114.121 MiB, 100%, 3.966 MiB/s, ETA 0s Checks: 3252 / 3252, 100% Deleted: 514 (files), 0 (dirs) Transferred: 3452 / 3452, 100% Elapsed time: 29.7s

real 0m29.770s user 0m4.287s sys 0m1.322s

time rclone sync -v --transfers=8 /tmp/astropy/v1.0.x/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 35.893 MiB / 35.893 MiB, 100%, 2.103 MiB/s, ETA 0s Checks: 3975 / 3975, 100% Deleted: 2472 (files), 0 (dirs) Transferred: 1560 / 1560, 100% Elapsed time: 17.2s

real 0m17.271s user 0m2.586s sys 0m0.713s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 114.163 MiB / 114.163 MiB, 100%, 3.150 MiB/s, ETA 0s Checks: 1881 / 1881, 100% Deleted: 378 (files), 0 (dirs) Transferred: 3654 / 3654, 100% Elapsed time: 33.8s

real 0m33.835s user 0m4.224s sys 0m1.277s

time rclone sync -v --transfers=8 /tmp/astropy/latest/ remote:readthedocs-media-dev/html/astropy/latest3/ Transferred: 0 B / 0 B, -, 0 B/s, ETA - Checks: 3975 / 3975, 100% Elapsed time: 4.6s

real 0m4.676s user 0m1.410s sys 0m0.213s

veredict

rclone increasing the number of parallel uploads (--transfers=8 ) is fast, obviously. But since aws-cli doesn't work as expected, we don't have anything else to compare to.

We could also test what happens if we improve our code:

  • Checking the file hash, so we don't re-upload it (we could even do this using threads).
  • Doing deletions un bulk
  • Using multiple threads to upload files

but if it's easy to speed up our code that's probably better. If it's really complex to speed up our solution, we should just use an external one.

I'm in favor of giving it a try.

Using an external command has its cons:

  • Extra setup / installation steps
  • Can't be easily tested (we would be mocking the whole command)
  • Logging won't be that nice, since we would only have access to stdout
  • Could be problematic using that command from some things and using the rest of the code for others (like if we make changes to our code, those won't be reflected in the external tool, like reading/writing/deleting a file)

stsewd avatar Nov 29 '22 22:11 stsewd

@stsewd Great work. I do think it's worth moving forward with using rclone for this. I agree with your limitations though. I do know in the past we had a Syncer abstraction that we could reuse for this: https://github.com/readthedocs/readthedocs.org/pull/6535/files#diff-369f6b076f78f7e41570254c681a3871b455a192428d138f2eeda28dc2eaf8c3 -- that would allow us to keep things working as normal in local tests. But I think rclone can also easily support local <-> local file transfers, so maybe we don't need anything that complex?

I generally trust your ideas around implementation, so happy to move forward with what you think is best.

ericholscher avatar Nov 29 '22 22:11 ericholscher

This is great! rclone is ~15x faster than our current approach! 💪🏼 Maybe it's possible to get the most from both approaches?

There is a django-rclone-storage (https://pypi.org/project/django-rclone-storage/) where we can get some inspiration for this.

Why not giving a try at this? It still executed rclone command behind the scenes, but at least gives us a layer of abstraction using a Django storage that integrates great with our code and reduce the cons you have mentioned.

humitos avatar Nov 30 '22 09:11 humitos

There is a django-rclone-storage

I also assumed we were talking about this solution as the first to try.

I'm rather hesitant to fiddle with threading in our own application implementation, or really trying to get too technically correct here. We don't have many projects being negatively affected by this issue, so our solution should match the severity.

A drop in replacement is a great option. Even if it's only twice as fast, that's good value for little effort.

agjohnson avatar Nov 30 '22 21:11 agjohnson

django-rclone doesn't have a sync option, it shouldn't be hard to implement that option, but also, that package is really just a thin wrapper around rclone (https://github.com/ElnathMojo/django-rclone-storage/), so not sure if we need to introduce a new dependency just for that, I was thinking in using rclone just for the sync operation, and have the other parts rely on django-storages (which I kind of prefer, since it has a whole community behind it compated to the other package)

stsewd avatar Dec 01 '22 15:12 stsewd

I haven't customized storages, so don't know how hard the replacement would be, but this still seems like a fair approach. A package does still seems easiest though, and I'm not dependency adverse here unless the project is unmaintained or something.

I'll defer to one of you all that have customized storages more. I'd have questions on our implementation if we're considering our own threading though. That's a rather expensive can of worms.

agjohnson avatar Dec 01 '22 17:12 agjohnson

@agjohnson we would just need override (or replace it) this method https://github.com/readthedocs/readthedocs.org/blob/54b5e13f1d5ac576b05f180d1b2001e22141d596/readthedocs/builds/storage.py#L94-L94

in https://github.com/readthedocs/readthedocs.org/blob/54b5e13f1d5ac576b05f180d1b2001e22141d596/readthedocs/storage/s3_storage.py#L21-L21

All other operations would be handled by django-storage as usual

stsewd avatar Dec 01 '22 17:12 stsewd

I think shelling out to rsync is probably fine? Especially since the existing package doesn't do what we want. It does seem a bit more explicit. We can try it out behind a feature flag, and if we find issues with it we can contribute something to the package?

ericholscher avatar Dec 01 '22 22:12 ericholscher

Noting here that we should take into account the symlink issue that we found in the last weeks when swapping the backend. In #9800 all the logic was moved into safe_open (used for opening user's configuration files and also when uploading the files to the storage) which will help on this. We will need to use the same logic [^1] immediately before uploading the files with rclone

[^1]: check the symlink target is inside DOCROOT (or more specifically, the project's path if possible)

humitos avatar Dec 13 '22 15:12 humitos