botocore
botocore copied to clipboard
Store schemas compressed on disk.
Is your feature request related to a problem? Please describe.
The data
directory of a botocore
install is over 50MB. The JSON inside compresses really well - we can see as the PyPI packages are just 7MB.
Describe the solution you'd like
It would be good to keep the schemas compressed on disk and only decompress them when reading into memory. This would save disk space, and probably a little time too since the decompression step is likely to be faster that reading all the bytes from disk.
Python's zlib or zip modules in the standard library can be used.
For an example of a library shipping data in a zip file, see my heroicons package: https://github.com/adamchainz/heroicons/blob/main/src/heroicons/init.py
Hi @adamchainz,
Thanks for the feature request! I'll review this with the team, although I can't make any guarantees as to when/if this will be implemented.
@adamchainz,
This is an interesting idea. This has been noted previously in a similar scenario with the AWS CLI as well:
https://github.com/aws/aws-cli/issues/5725
The AWS SDKs consume the API models from upstream. Changing the way that they are stored and accessed would be a significant feature. One drawback would be the lack of direct human readability of the API models that are currently available in the Python SDK. It would be difficult to see where API changes were introduced between versions of the SDK. For example, removing the documentation strings from the models would cut 20MB off of the size, which might be useful in a CI/CD environment.
Do you have specific scenarios of your own that a slimmed down version?
It would be difficult to see where API changes were introduced between versions of the SDK.
One can use the textconv
git attribute in the repo to have git decompress the files before comparing them.
Do you have specific scenarios of your own that a slimmed down version?
This affects me in a couple ways:
-
I bundle
boto3
into my lambda functions so I can pin an exact version. The occasional API change can break code. Bundling botocore takes a function over the 50MB limit, which requires ann upload to S3 rather than directly to Lambda, and prevents the console code editor from working. -
I have maybe 30 projects using boto3/botocore, each with their own virtual environment. This means I have 1.5GB of botocore, which isn't a great use of disk space.
I'm in favor of this feature as well. They could stay uncompressed in the source code here, but be bundled into a zip for the released wheel. They'd stay programmatically available in botocore
exactly as they are today, it would be the Loader
that would change to read them out of the zip file rather than directly off disk.
The benefits to install time, artifact size, and Lambda in-console editing would be well-worth the effort imo.
Hey all, just wanted to chime in real quick to mention that I took some time today to play around with the ideas here.
I think @benkehoe's suggestion makes a lot of sense, and I took a crack at implementing support for building wheels that include compressed models instead of the plaintext versions. However, rather than modifying the loader to include an additional possible location that checks within a zip, I decided to update the JSONFileLoader
to look for either a plaintext .json
file or a gzip compressed .json.gz
file. This means that a compressed model can be present in any location the Loader
class might look (e.g. ~/.aws/models
).
In addition to support for loading gzip compressed models, I've added a script to the scripts
folder that will modify a botocore wheel in-place replacing all .json
files in the data
directory with a gzip compressed version. You can take a look at the branch on my fork here.
Using my branch you should be able to generate then modify a wheel that includes the compressed models instead.
$ python setup.py bdist_wheel
$ ./scripts/compress-wheel-data dist/botocore-*-none-any.whl
It'd be great if some of you could test the compressed wheels out as I do have some concerns around compatibility / performance if we were ever to begin publishing wheels like this instead of the uncompressed version.
As for my testing (on an M1 macbook pro) I saw the following:
Install times were marginally in the favor of the wheel with compressed models but it wasn't significant and might have just been margin of error.
Comparing the unzipped wheel I saw about a 5x reduction in disk space going from 66M to 13M:
$ du -h -d 0 gzip/botocore-1.23.32
13M gzip/botocore-1.23.32
$ du -h -d 0 normal/botocore-1.23.32
66M normal/botocore-1.23.32
I also tried creating a new Session
object and creating a client (the largest model is ec2
and the smallest is sagemaker-edge
to see how this would impact load times. These results are the average of 100 runs:
ec2 Avg: 0.05411987456999998, Min: 0.03956283299999974, Max: 0.083342042
sagemaker-edge Avg: 0.02530930502999995, Min: 0.0206438750000002, Max: 0.05621566599999994
ec2 Avg: 0.048753524610000036, Min: 0.034418124999999966, Max: 0.08430220900000002
sagemaker-edge Avg: 0.02403186916999993, Min: 0.01891829100000031, Max: 0.057971249999999586
Unfortunately, loading the compressed models is about 10% slower. I'm sure there's different compression algorithms that might produce better results here but I'm concerned about compatibility if we were to use a less ubiquitous algorithm than gzip.
Do you know what gzip level you used? Python's gzip
module defaults to 9, which is the slowest, because it applies the most compression. The gzip
CLI uses 6 by default.
Even level 1 would probably provide significant gains given the repetitition in JSON.
Thanks for taking a look at this! Can we get a comparison on wheel size and performance between compressing the files individually versus all together? I get the benefit of allowing non-default locations to have them individually, but if there's a big difference for the primary package it could make sense to special-case that as a single zip.
@benkehoe The wheel size wasn't significantly impacted by a a single zip vs individual models. For the particular botocore version I used I go the following:
Size of the .whl
:
Uncompressed model data dir: 8.6M
Individual model file compressed data dir: 8.6M
Single zip for data dir: 8.3M
As for the decompressed package I got: Uncompressed model data dir: 66M Individual model file compressed data dir: 13M Single zip for data dir: 11M
So a slight improvement in favor of a single zip. Getting data on how that affects botocore client load times isn't something I've tested since I haven't implemented it. I do have concerns around the monolithic nature of a single zip and the performance characteristics of random access in the zip.
@adamchainz My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.
My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.
Ah, you are right. My bad.
@benkehoe
I ran a sanity check comparing all 3 by doing a minimal open directly a model in the data
dir or data.zip
:
Loading ec2/2016-11-15/service-2.json
normal_open Avg: 0.009063401630000006, Min: 0.008451374999999997, Max: 0.010835124999999945, Sum: 0.9063401630000006
gzip_open Avg: 0.013103255060000008, Min: 0.012516417000000057, Max: 0.015194916000000003, Sum: 1.3103255060000008
nested_zip_open Avg: 0.016699820469999987, Min: 0.015805040999999687, Max: 0.020624874999999765, Sum: 1.6699820469999986
Loading sagemaker-edge/2020-09-23/service-2.json
normal_open Avg: 4.132742999999974e-05, Min: 3.9582999999999285e-05, Max: 8.28330000000009e-05, Sum: 0.0041327429999999735
gzip_open Avg: 6.306003999999984e-05, Min: 6.0208000000002565e-05, Max: 0.00011729200000000148, Sum: 0.006306003999999983
nested_zip_open Avg: 0.0048496287200000005, Min: 0.00480520799999995, Max: 0.005483624999999992, Sum: 0.48496287200000004
The nested zip is the slowest and impacts smaller models pretty significantly. This is only considering loading the .json
contents because we already knew the path. I think when you start to consider the nature of the Loader
class the overhead of going into the zip file will be even more significant. The Loader
traverses sub-directories and lists files to discover available API versions / models, which doesn't really make sense in the context of a zip file. The ZipFileLoader
class would likely need to be a significant deviation from the existing one to mitigate the performance overhead and my hunch is that it would still be slower overall.
Awesome, this all makes sense. The small difference in size (that surprised me a bit) combined with individual zips better on both performance and code simplicity makes it no contest. Thanks for humoring me and validating it though!
Feels like this is trying to fix similar symptoms as #1543 but in a different way. Though I don't think the two ideas are mutually exclusive, just linking
@gricey432 You're absolutely correct that the two approaches aren't mutually exclusive. When I was doing the initial proof of concept script on my branch I was tempted to add a services filter that could allow the built wheel to only include a subset of services but didn't quite have time.
Could save roughly 50 megs in lambda installs by doing this. That means installing botocore/boto3 + telemetry tools + something like pandas usually breaks the bank when deploying to Lambda (even after removing pyc and stripping shared objects)
Hey everyone, wanted to provide a quick status update.
Starting in 1.32.0, we began compressing select service models (Amazon EC2, Amazon Sagemaker, and Amazon Quicksight) in our .whl files distributed on PyPI. With this change, we were able to reduce the size of botocore by 9.4 MB (11%) to a total of 76.1 MB on disk. This was the final step in a series of changes we've made over the last year to validate and enable today's release.
With 1.32.1, we've rolled this change out to all service models in our .whl files. This allows us to shrink botocore from 85.5 MB in our last 1.31.x release to 19.9 MB for a total savings of 77%. We hope this will be an impactful first step towards making Botocore less difficult to use in space constrained environments.
Going forward, we have additional areas we're looking to improve and will provide updates as we have them. We'd welcome any feedback you might have in the mean time.
Nice work! This is an update I've been waiting for a long time.
❯ for VERSION in 1.31.83 1.31.84 1.31.85 1.32.0 1.32.1; do echo -n "$VERSION --> " && docker run --rm python:3.11-slim bash -c "pip install --disable-pip-version-check --quiet --root-user-action=ignore botocore==$VERSION && du -h -s /usr/local/lib/python3.11/site-packages/botocore"; done
1.31.83 --> 86M /usr/local/lib/python3.11/site-packages/botocore
1.31.84 --> 86M /usr/local/lib/python3.11/site-packages/botocore
1.31.85 --> 86M /usr/local/lib/python3.11/site-packages/botocore
1.32.0 --> 77M /usr/local/lib/python3.11/site-packages/botocore
1.32.1 --> 24M /usr/local/lib/python3.11/site-packages/botocore
This is great news! Will this change end up in the CLI as well?
Just a note for people who are excited about the possibilities of smaller Lambda deploy packages: this probably won't help you get under 50 MB, because what you upload to Lambda is typically compressed already.
That is, botocore
on its own is now smaller because it's compressed, but your package that includes botocore
isn't - you were already compressing botocore
yourself. Compressing it twice doesn't help!
I'd also like to drop a plug here for https://github.com/boto/boto3/issues/2702, you tell us botocore version 1.32.1 has this change and then it's work for us to figure out what boto3 version it is (it's 1.29.1), when they should just be the same.