botocore Store schemas compressed on disk.

Is your feature request related to a problem? Please describe.

The data directory of a botocore install is over 50MB. The JSON inside compresses really well - we can see as the PyPI packages are just 7MB.

Describe the solution you'd like

It would be good to keep the schemas compressed on disk and only decompress them when reading into memory. This would save disk space, and probably a little time too since the decompression step is likely to be faster that reading all the bytes from disk.

Python's zlib or zip modules in the standard library can be used.

For an example of a library shipping data in a zip file, see my heroicons package: https://github.com/adamchainz/heroicons/blob/main/src/heroicons/init.py

Apr 27 '21 15:04 adamchainz

Hi @adamchainz,

Thanks for the feature request! I'll review this with the team, although I can't make any guarantees as to when/if this will be implemented.

Apr 29 '21 15:04 stobrien89

@adamchainz,

This is an interesting idea. This has been noted previously in a similar scenario with the AWS CLI as well:

https://github.com/aws/aws-cli/issues/5725

The AWS SDKs consume the API models from upstream. Changing the way that they are stored and accessed would be a significant feature. One drawback would be the lack of direct human readability of the API models that are currently available in the Python SDK. It would be difficult to see where API changes were introduced between versions of the SDK. For example, removing the documentation strings from the models would cut 20MB off of the size, which might be useful in a CI/CD environment.

Do you have specific scenarios of your own that a slimmed down version?

May 01 '21 13:05 kdaily

It would be difficult to see where API changes were introduced between versions of the SDK.

One can use the textconv git attribute in the repo to have git decompress the files before comparing them.

Do you have specific scenarios of your own that a slimmed down version?

This affects me in a couple ways:

I bundle boto3 into my lambda functions so I can pin an exact version. The occasional API change can break code. Bundling botocore takes a function over the 50MB limit, which requires ann upload to S3 rather than directly to Lambda, and prevents the console code editor from working.
I have maybe 30 projects using boto3/botocore, each with their own virtual environment. This means I have 1.5GB of botocore, which isn't a great use of disk space.

May 07 '21 21:05 adamchainz

I'm in favor of this feature as well. They could stay uncompressed in the source code here, but be bundled into a zip for the released wheel. They'd stay programmatically available in botocore exactly as they are today, it would be the Loader that would change to read them out of the zip file rather than directly off disk.

The benefits to install time, artifact size, and Lambda in-console editing would be well-worth the effort imo.

Jan 10 '22 23:01 benkehoe

Hey all, just wanted to chime in real quick to mention that I took some time today to play around with the ideas here.

I think @benkehoe's suggestion makes a lot of sense, and I took a crack at implementing support for building wheels that include compressed models instead of the plaintext versions. However, rather than modifying the loader to include an additional possible location that checks within a zip, I decided to update the JSONFileLoader to look for either a plaintext .json file or a gzip compressed .json.gz file. This means that a compressed model can be present in any location the Loader class might look (e.g. ~/.aws/models).

In addition to support for loading gzip compressed models, I've added a script to the scripts folder that will modify a botocore wheel in-place replacing all .json files in the data directory with a gzip compressed version. You can take a look at the branch on my fork here.

Using my branch you should be able to generate then modify a wheel that includes the compressed models instead.

$ python setup.py bdist_wheel
$ ./scripts/compress-wheel-data dist/botocore-*-none-any.whl

It'd be great if some of you could test the compressed wheels out as I do have some concerns around compatibility / performance if we were ever to begin publishing wheels like this instead of the uncompressed version.

As for my testing (on an M1 macbook pro) I saw the following:

Install times were marginally in the favor of the wheel with compressed models but it wasn't significant and might have just been margin of error.

Comparing the unzipped wheel I saw about a 5x reduction in disk space going from 66M to 13M:

$ du -h -d 0 gzip/botocore-1.23.32
 13M    gzip/botocore-1.23.32

$ du -h -d 0 normal/botocore-1.23.32
 66M    normal/botocore-1.23.32

I also tried creating a new Session object and creating a client (the largest model is ec2 and the smallest is sagemaker-edge to see how this would impact load times. These results are the average of 100 runs:

ec2 Avg: 0.05411987456999998, Min: 0.03956283299999974, Max: 0.083342042
sagemaker-edge Avg: 0.02530930502999995, Min: 0.0206438750000002, Max: 0.05621566599999994

ec2 Avg: 0.048753524610000036, Min: 0.034418124999999966, Max: 0.08430220900000002
sagemaker-edge Avg: 0.02403186916999993, Min: 0.01891829100000031, Max: 0.057971249999999586

Unfortunately, loading the compressed models is about 10% slower. I'm sure there's different compression algorithms that might produce better results here but I'm concerned about compatibility if we were to use a less ubiquitous algorithm than gzip.

Jan 14 '22 00:01 joguSD

Do you know what gzip level you used? Python's gzip module defaults to 9, which is the slowest, because it applies the most compression. The gzip CLI uses 6 by default.

Even level 1 would probably provide significant gains given the repetitition in JSON.

Jan 14 '22 17:01 adamchainz

Thanks for taking a look at this! Can we get a comparison on wheel size and performance between compressing the files individually versus all together? I get the benefit of allowing non-default locations to have them individually, but if there's a big difference for the primary package it could make sense to special-case that as a single zip.

Jan 14 '22 17:01 benkehoe

@benkehoe The wheel size wasn't significantly impacted by a a single zip vs individual models. For the particular botocore version I used I go the following:

Size of the .whl: Uncompressed model data dir: 8.6M Individual model file compressed data dir: 8.6M Single zip for data dir: 8.3M

As for the decompressed package I got: Uncompressed model data dir: 66M Individual model file compressed data dir: 13M Single zip for data dir: 11M

So a slight improvement in favor of a single zip. Getting data on how that affects botocore client load times isn't something I've tested since I haven't implemented it. I do have concerns around the monolithic nature of a single zip and the performance characteristics of random access in the zip.

@adamchainz My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.

Jan 14 '22 18:01 joguSD

My understanding was that the compression level mostly impacts the time to compress, so the wheel is generated using level 9 compression on all of the model files. A quick search seems to confirm this, higher compression level => slower compression times, smaller files, marginally faster decompress times.

Ah, you are right. My bad.

Jan 14 '22 18:01 adamchainz

@benkehoe

I ran a sanity check comparing all 3 by doing a minimal open directly a model in the data dir or data.zip:

Loading ec2/2016-11-15/service-2.json
normal_open Avg: 0.009063401630000006, Min: 0.008451374999999997, Max: 0.010835124999999945, Sum: 0.9063401630000006
gzip_open Avg: 0.013103255060000008, Min: 0.012516417000000057, Max: 0.015194916000000003, Sum: 1.3103255060000008
nested_zip_open Avg: 0.016699820469999987, Min: 0.015805040999999687, Max: 0.020624874999999765, Sum: 1.6699820469999986


Loading sagemaker-edge/2020-09-23/service-2.json
normal_open Avg: 4.132742999999974e-05, Min: 3.9582999999999285e-05, Max: 8.28330000000009e-05, Sum: 0.0041327429999999735
gzip_open Avg: 6.306003999999984e-05, Min: 6.0208000000002565e-05, Max: 0.00011729200000000148, Sum: 0.006306003999999983
nested_zip_open Avg: 0.0048496287200000005, Min: 0.00480520799999995, Max: 0.005483624999999992, Sum: 0.48496287200000004

The nested zip is the slowest and impacts smaller models pretty significantly. This is only considering loading the .json contents because we already knew the path. I think when you start to consider the nature of the Loader class the overhead of going into the zip file will be even more significant. The Loader traverses sub-directories and lists files to discover available API versions / models, which doesn't really make sense in the context of a zip file. The ZipFileLoader class would likely need to be a significant deviation from the existing one to mitigate the performance overhead and my hunch is that it would still be slower overall.

Jan 14 '22 20:01 joguSD

Awesome, this all makes sense. The small difference in size (that surprised me a bit) combined with individual zips better on both performance and code simplicity makes it no contest. Thanks for humoring me and validating it though!

Jan 14 '22 20:01 benkehoe

Feels like this is trying to fix similar symptoms as #1543 but in a different way. Though I don't think the two ideas are mutually exclusive, just linking

Feb 04 '22 01:02 gricey432

@gricey432 You're absolutely correct that the two approaches aren't mutually exclusive. When I was doing the initial proof of concept script on my branch I was tempted to add a services filter that could allow the built wheel to only include a subset of services but didn't quite have time.

Feb 04 '22 19:02 joguSD

Could save roughly 50 megs in lambda installs by doing this. That means installing botocore/boto3 + telemetry tools + something like pandas usually breaks the bank when deploying to Lambda (even after removing pyc and stripping shared objects)

Oct 19 '22 21:10 whardier

Hey everyone, wanted to provide a quick status update.

Starting in 1.32.0, we began compressing select service models (Amazon EC2, Amazon Sagemaker, and Amazon Quicksight) in our .whl files distributed on PyPI. With this change, we were able to reduce the size of botocore by 9.4 MB (11%) to a total of 76.1 MB on disk. This was the final step in a series of changes we've made over the last year to validate and enable today's release.

With 1.32.1, we've rolled this change out to all service models in our .whl files. This allows us to shrink botocore from 85.5 MB in our last 1.31.x release to 19.9 MB for a total savings of 77%. We hope this will be an impactful first step towards making Botocore less difficult to use in space constrained environments.

Going forward, we have additional areas we're looking to improve and will provide updates as we have them. We'd welcome any feedback you might have in the mean time.

Nov 15 '23 23:11 nateprewitt

Nice work! This is an update I've been waiting for a long time.

❯ for VERSION in 1.31.83 1.31.84 1.31.85 1.32.0 1.32.1; do echo -n "$VERSION  --> " && docker run --rm python:3.11-slim bash -c "pip install --disable-pip-version-check --quiet --root-user-action=ignore botocore==$VERSION && du -h -s /usr/local/lib/python3.11/site-packages/botocore"; done
1.31.83  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.31.84  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.31.85  --> 86M	/usr/local/lib/python3.11/site-packages/botocore
1.32.0  --> 77M	/usr/local/lib/python3.11/site-packages/botocore
1.32.1  --> 24M	/usr/local/lib/python3.11/site-packages/botocore

Nov 16 '23 08:11 armenak-baburyan

This is great news! Will this change end up in the CLI as well?

Nov 16 '23 15:11 benkehoe

Just a note for people who are excited about the possibilities of smaller Lambda deploy packages: this probably won't help you get under 50 MB, because what you upload to Lambda is typically compressed already.

That is, botocore on its own is now smaller because it's compressed, but your package that includes botocore isn't - you were already compressing botocore yourself. Compressing it twice doesn't help!

Nov 16 '23 16:11 bbayles

I'd also like to drop a plug here for https://github.com/boto/boto3/issues/2702, you tell us botocore version 1.32.1 has this change and then it's work for us to figure out what boto3 version it is (it's 1.29.1), when they should just be the same.

Nov 16 '23 17:11 benkehoe

botocore botocore copied to clipboard

Store schemas compressed on disk.

botocore
botocore copied to clipboard