heroku-buildpack-python icon indicating copy to clipboard operation
heroku-buildpack-python copied to clipboard

CNB-shimmed buildpack generates large iterative images

Open ShadowJonathan opened this issue 5 years ago • 2 comments

The current CNB shim to this buildpack makes it so that with every image, a large amount of data (+100MB every build) under /workspace/.heroku/python is added and not cached/rebased from previous images (according to dive).

(horrible powershell-launched dive screenshot)

It seems this layer is adding the python runtime and the app files at the same time, while probably the following caching mechanism can be used:

  1. check runtime.txt/.python-version/default defined python version, check if cached layer exists, reuse if true
  2. check requirements/dependencies (with current runtime layer), use cached layer if exists
  3. add application files.

Generate layers and metadata for every step.

At best, this'll only add the equivalent size of all application files to the image.

At worst, this'll re-download the python runtime, and reinstall all dependencies, if required.

Edit: Disclaimer: i havent really dug into the buildpacks spec enough to know what's possible, and what could be changed to the shimmed image to make it work properly within the context of CNB images

ShadowJonathan avatar Jun 12 '20 16:06 ShadowJonathan

@ShadowJonathan Hi! Thank you for filing this.

As you spotted, the issues you mention are due to the current Python CNB being a shimmed version of this existing v2 buildpack: https://github.com/heroku/pack-images/blob/4a68227d0523d004cb8b76b1e229e06604c24b44/builder.toml#L29-L31 https://github.com/heroku/cnb-shim

I've recently taken over as the Heroku Python language owner - and there are some reasonably urgent buildpack rough edges that I'd like to tackle shorter term, but adding native CNB support is definitely on the list for later on :-)

edmorley avatar Jun 12 '20 17:06 edmorley

Thanks for the quick reply, to amend my initial proposal: I quickly realised something while continuously pulling and pushing images to the repo, which is that layers are digested based on their diffs (add files, move files, remove files), and not based on their contents + the digest of one layer down (which is what I previously thought was the case, like a DAG-like structure I see in many cases), this makes my new proposal quite simple:

Split the build stage into three: runtime, dependencies, application

Deduplication will happen on the registry/image cache side.

Back to your original comment: if there's anything that I can do, or if there's any suggestions you have for me to implement my own proposal, please say so, because that could help development quite a lot (personal and wider-spread)

ShadowJonathan avatar Jun 12 '20 17:06 ShadowJonathan

The Python CNB is in progress (in-between other projects unfortunately, otherwise it would already be complete) - to track development, watch this repo: https://github.com/heroku/buildpacks-python

Closing since the repo this issue is filed in is for the classic buildpack, not the CNB.

edmorley avatar Sep 21 '22 21:09 edmorley