jib icon indicating copy to clipboard operation
jib copied to clipboard

JIB needs option to honour source file timestamps

Open hugolumsdon opened this issue 6 years ago • 19 comments

You talk about reproducible builds with the option to either use a 1970 timestamp for all files added to the container image - or use a specific timestamp.

However what WE really want is the option to maintain all the original timestamps from the source WAR / JAR. This surely makes far more sense when we're talking reproducibility.

The docker image build time is kind of irrelevant - all we're doing is overlaying the contents of a WAR/JAR - it's the timestamps of those files in the WAR/JAR archives that we care about as those reflect the version/build dates.

Can you please discuss and consider adding this as a feature. Thanks.

hugolumsdon avatar Sep 25 '19 13:09 hugolumsdon

Hi @hugolumsdon,

We once implemented preserving the timestamps of the files Jib take to put in the image but due to some unclarity we decided to punt it until we hear more specific use cases.

Could you elaborate on what exactly you mean by "those files in the WAR/JAR" in your use case? For example, when you compile your source code, the generated .class file will have a timestamp of the build time that is different from that of the source file. So, for example, when different developers build the same repo on different machines, the images will be considered different even if the source is identical. And there are many resources generated at build time, for example, minimized JavaScript files in a WAR. Considering these, we were a bit skeptical about the usefulness of this strategy.

chanseokoh avatar Sep 25 '19 13:09 chanseokoh

And even when we preserve the timestamps of the source files, when different developers check out the same repo (git clone) on different machines, the timestamps of the source files will be different even if the files are identical. Have you considered this aspect too?

chanseokoh avatar Sep 25 '19 13:09 chanseokoh

You have a very good point around 'git clone' timestamps.

However, I think it's a very useful feature to use the source timestamps because it's an easy point of reference if someone wants to compare a file version deployed across hosts - or compare the Docker image with what came out of the build system.

Yes of course we have meta-data / labels with all the build info in - but in the heat of battle our prod-ops teams might want to go check a specific file version on a running host / Docker image instance - or across hosts.

We are incrementally migrating to the cloud - we are currently deploying the same exact applications on bare-metal and wrapped with Docker images on the cloud. Having consistent time/date stamps is useful from the prod-ops perspective.

We typically have /include folders in our WAR files with static assets (css/images/js/fonts) - and we write the file timestamp into the URL and employ far-future cache headers (1 year). Having the same exact cache URL for the same assets on both bare-metal and cloud is a good thing.

hugolumsdon avatar Sep 25 '19 14:09 hugolumsdon

Also currently from the manual it doesn't seem like you can use USE_CURRENT_TIMESTAMP for the filesModificationTime setting.

As per my comments for the /include far-future-caching - we need a timestamp that changes per release. Otherwise obviously, if images change (which happens) it won't be picked up.

So I think having the source file timestamp option would solve this one.

To add - not everyone uses git of course - SVN would keep the original source timestamp.

hugolumsdon avatar Sep 25 '19 14:09 hugolumsdon

USE_CURRENT_TIMESTAMP is not supported for filesModificationTime. If this were supported, would you still need the option to preserve the filesystem timestamps? However, this is extremely bad, since you lose entire caching benefit. That is, if your application layers is 1GB, the entire 1GB data will be uploaded every time you make a new build even if the files are all same.

And I get the impression that the main reason you want this option is for the static files in /include inside a WAR that are never generated at build time. For example, if you do mvn/gradle clean ..., their timestamps will not change in the build directory (target/ or build/). And what is really required by your current infra is that the file timestamps for these static files must change whenever you make a new release, even if the files themselves don't change. Did I understand it right?

So I think having the source file timestamp option would solve this one. To add - not everyone uses git of course - SVN would keep the original source timestamp.

So I think I'm still unclear about what "source files" mean here. For Java source files (.java), do you want to preserve the timestamps of the source files or the compiled .class files? Also, for other resources files under the build directory like generated/optimized JavaScript files or resources transformed/copied at build time by various build plugins?

Lastly, what do you think of using a git/SVN commit timestamp for the files? I think that would clearly and robustly mark from which source of truth the image was built upon and it would be easy to identify a release. In that case, whenever you make a new commit for a new release, all the file timestamps including the /include static files will advance per release.

chanseokoh avatar Sep 25 '19 14:09 chanseokoh

The timestamps for the static files either need to reflect the original timestamps from source code control - or they can change per release. What we can't have, is them having a fixed timestamp of 1970 because of the caching.

"source files" - I'm using this term consistently to mean the actual file in the WAR archive. I am not referring to Java source or .class - just what has been exploded from the WAR file.

Your commit timestamp comment is valid. We can live with being able to specify a fixed timestamp for modificationTime (even if it's just ${maven.build.timestamp}. However I still think having the option to reflect the WAR file timestamps is a nice feature. It's what you expect when you explode a WAR file after all. However - I'll concede it's more of a nice-to-have.

Thanks.

hugolumsdon avatar Sep 25 '19 15:09 hugolumsdon

Thanks for the feedback, @hugolumsdon.

(even if it's just ${maven.build.timestamp})

I strongly discourage using a build-time timestamp, since this loses the entire cache benefit. For example, if the size of your application is 1GB, the entire 1GB layers will be rebuilt, re-cached, and re-uploaded over and over as "new" layers whenever you make a new build even if you don't change any contents because of the different timestamps. For this reason, we intentionally does not support USE_CURRENT_TIMESTAMP for files, and we currently don't have any intention to support it in the future either. (USE_CURRENT_TIMESTAMP for the image creation time is far less disastrous, because only a small JSON has to be reconstructed and re-pushed.)

I'll leave this issue open as a low priority, but to be frank, I cannot guarantee if we will ever work on it, for the same reason I said just above. Our philosophy is not just about implementing every nice-to-have feature request or becoming a mere fall-through abstraction on top of Dockerfile/Docker. One of our goals (among many) is to be a good steward in the community of containerization, guide people to build images in the right way, and prevent accidental misuse or inefficiency by blocking potential pitfalls in an opinionated way. If this feature were to be implemented, we would most likely go over lengthy internal debates. I hope you will understand. And as such, I should say again I discourage using the original file timestamps on your filesystem–most likely there should be a better and right way.

chanseokoh avatar Sep 25 '19 18:09 chanseokoh

Hi - ok that's fine, I understand. Your suggestion around using the git commit timestamp was a good one and not something we'd considered - so we're going to go with that.

I also hadn't appreciated the way the caching worked. So thanks for your suggestions and comments, very helpful.

Perhaps others will vote for this feature and come up with other use cases, I don't know.

Regards, Hugo.

hugolumsdon avatar Sep 25 '19 20:09 hugolumsdon

I just want to respond to this point actually:

And as such, I should say again I discourage using the original file timestamps on your filesystem–most likely there should be a better and right way.

That's one way if looking at this - but the converse view is that all JIB is doing is exploding a given WAR file into a Docker image. You could argue it should not be applying its own opinionated approach.

Basically if JIB honoured the original WAR file timestamps, then time stamping is something to be considered when building the WAR file. This would be consistent with the approach when not using Docker - i.e. bare metal - where timestamps are usually preserved when exploding a WAR.

I'm fine with your suggestion to use the git commit timestamp as I already said, but unless I'm misreading this, I think your answer is less about 'best practice' and more about the JIB caching implementation. I'm struggling to see what is better about reseting the timestamps on files exploded from a WAR file than maintaining the originals. Anyway this is now more of a hypothetical discussion admittedly.

hugolumsdon avatar Sep 25 '19 21:09 hugolumsdon

I think your answer is less about 'best practice' and more about the JIB caching implementation

Just wanted to quickly point out that the image caching is universal and not just specific to Jib. If you create an image with a different timestamp, that is a different image by itself from the perspective of Docker and all container registries. If you change a file timestamp, you have to create a new layer (tarball actually) and re-upload the tarballs to, e.g., Docker Hub. Registries won't be able to reuse the previous tarballs and end up duplicating them.

Moreover, this is also about fast inner development loop and saving time and resources. For example, if you modify only the Java files, only the layer containing .class files need to be rebuilt and re-pushed, so this will be fast and efficient in many ways compared to poorly layered images. But using a build-time stamp will outright invalidate this optimized layering.

chanseokoh avatar Sep 25 '19 21:09 chanseokoh

Ok good point. A change to a file timestamp means a new layer. However if you honoured the WAR timestamps and the same WAR was re-used for multiple builds - there would be no change to that layer would there not? What's the difference?

if you modify only the Java files, only the layer containing .class files need to be rebuilt and re-pushed

But we're talking exploding a WAR file for our use case. Surely that's just 1 layer for the entire explosion of the WAR onto the container file system - there is no layer just containing .class files.

[and by the way - this is just supposed to be a friendly conceptual discussion - nothing more!]

hugolumsdon avatar Sep 25 '19 22:09 hugolumsdon

No worries for more input. Happy to discuss. We'd like to understand users' use cases and better support them.

However if you honoured the WAR timestamps and the same WAR was re-used for multiple builds - there would be no change to that layer would there not? What's the difference?

Correct. As long as you build and push from the same machine without doing a clean build, Jib would ensure reproducibility. But if you do a clean build or you build on a different machine, you would generate .class files with more recent timestamps and hence different tarballs. We want to discourage this.

But we're talking exploding a WAR file for our use case. Surely that's just 1 layer for the entire explosion of the WAR onto the container file system - there is no layer just containing .class files.

In fact, Jib creates multiple layers for a WAR. WEB-INF/**/*.class files go into the .class layer, other non-.class files under WEB-INF go into the resources layer, WEB-INF/lib/*.jar go into dependency layers, etc. All those layers will be overlaid to form the entire WAR structure. That's how the Docker images work. So if a developer just updates a JAR dependency, only that layer needs to be rebuilt and pushed to a registry. Nothing else is duplicated in Jib's cache or in the registry.

chanseokoh avatar Sep 25 '19 22:09 chanseokoh

In fact, Jib creates multiple layers for a WAR. WEB-INF/**/.class files go into the .class layer, other non-.class files under WEB-INF go into the resources layer, WEB-INF/lib/.jar go into dependency layers, etc. All those layers will be overlaid to form the entire WAR structure. That's how the Docker images work. So if a developer just updates a JAR dependency, only that layer needs to be rebuilt and pushed to a registry. Nothing else is duplicated in Jib's cache or in the registry.

Ah - that's what I was missing - thanks! That explains a lot. Is this documented and I just missed it? How many layers are there total for a WAR file - 4?

Anyway - this is a pretty neat idea - although pretty fine grained - and much of the time our pull requests could span many/all of those layers. Still nice idea. Apologies for not exploring the JIB generated Docker file structure.

We have dozens of CI build hosts - so there won't be much advantage taken of the local JIB cache for release builds. However it's still very useful for devs who want to use JIB locally - so I applaud your fine-grained approach.

hugolumsdon avatar Sep 25 '19 22:09 hugolumsdon

Thanks!

The Jib layering is like an implementation detail and not documented. This is mostly intended, and I'll explain why. People who are versed in Docker know what they are doing and do care about layering carefully because they know it is important. And some people take reproducibility seriously for security too. Basically, everyone who builds images should have some degree of Docker knowledge because otherwise, they easily tumble and end up creating a poorly composed image that unnecessarily takes a lot of time to rebuild when they touch just one file. Especially it is easy to ruin reproduciblity. But obviously, not every Java application developer is a Docker expert, and one of the goals of Jib is to address this problem–help Java developers who have never used or heard of Docker be able to easily create an optimized image so that they can just focus on developing their apps. No Dockerfile needed, no Docker installation required, and no Docker expertise. In that sense, we don't really want you to having to look into exploring the Docker image structure generated by Jib. No need to apology :) Like, we take care of this non-sense automatically, and you do what you can best do.

And yes, from your perspective, the Jib layering may sound too fine-grained. Probably true for a pull request. But developers do frequent small changes and rebuilding to test their changes locally. And building an image is a lot heavier than just compiling Java source files in an IDE. Every lost second is a loss in productivity. So with Jib, if they change a single Java class, when rebuilding the image locally, they don't have to wait to tar up the whole JAR dependencies that often go over hundreds of MB.

But to answer your question anyway, the number of layers depends on the WAR contents, but basically,

  • .class layer
  • resources layer
  • third-party dependency layer
  • project-level dependency layer
  • SNAPSHOT dependency layer
  • and any number of extra layers for custom files, if configured

If your WAR is small, then yes, probably it is a non-issue to duplicate new images everywhere ( whether it's Jib's cache, local Docker daemon, remote registries, etc.) in your case. But I've seen many cases of huge WAR files. And there is really no cost for going fine-grained. It is almost all the time beneficial, if not always.

chanseokoh avatar Sep 25 '19 23:09 chanseokoh

Thanks.

People who are versed in Docker know what they are doing and do care about layering carefully because they know it is important. And some people take reproducibility seriously for security too. Basically, everyone who builds images should have some degree of Docker knowledge because otherwise, they easily tumble and end up creating a poorly composed image that unnecessarily takes a lot of time to rebuild when they touch just one file. Especially it is easy to ruin reproduciblity.

Trust me I care - just we've only recently moved to using JIB for the WAR explosion step - rather than doing this in a Docker build - and juggling dozens of things as usual, and we hadn't picked up on the fine-grained JIB layering approach.

By the way - being able to integrate JIB into existing Java/Maven project CI builds to do the WAR overlay onto a base image - and basically squirt out Docker images automatically from our existing CI jobs, is great.

And yes, from your perspective, the Jib layering may sound too fine-grained. Probably true for a pull request. But developers do frequent small changes and rebuilding to test their changes locally. And building an image is a lot heavier than just compiling Java source files in an IDE. Every lost second is a loss in productivity. So with Jib, if they change a single Java class, when rebuilding the image locally, they don't have to wait to tar up the whole JAR dependencies that often go over hundreds of MB.

Please re-read what I wrote - I'm agreeing with you 100%. I never said "too fine-grained".

"Anyway - this is a pretty neat idea - although pretty fine grained .... However it's still very useful for devs who want to use JIB locally - so I applaud your fine-grained approach."

hugolumsdon avatar Sep 25 '19 23:09 hugolumsdon

@hugolumsdon there are plugins for recording build numbers or git commits as part of your build. We have an example using the similar git-commit-id-plugin. Although I haven't used it, the buildnumber-maven-plugin will apparently record the SVN revision number.

briandealwis avatar Sep 26 '19 19:09 briandealwis

@hugolumsdon there are plugins for recording build numbers or git commits as part of your build. We have an example using the similar git-commit-id-plugin. Although I haven't used it, the buildnumber-maven-plugin will apparently record the SVN revision number.

Hey - yeah thanks - actually we already had "git-commit-id-plugin" integrated - so referencing the git commit timestamp in JIB was a very easy change we already made.

hugolumsdon avatar Sep 27 '19 21:09 hugolumsdon

Here a related suggestion, that might be a solution to this issue as well: https://github.com/GoogleContainerTools/jib/issues/2462#issuecomment-628064979

ST-DDT avatar May 13 '20 15:05 ST-DDT

Hello, since this discussion is still open, I'd like to present another potential use case (and a solution).

We have a service which hosts static web resources (html, css, js). These resources are either a part of the project itself or they are bundled in other jars (similar to how webjars work). The default Spring MVC behavior while hosting these files is to read their modification time and present it to the client as a Last-Modified header. That information is then used by the browser to cache resources and sent back to the server as an If-Modified-Since header, so that the server can decide whether to respond with status 200 (including body) or to respond with status 304 (and tell the browser to use the cached response).

Since JIB sets all modification times to epoch + 1 second, this behavior does not work as expected. When we release a new version, the server sees the files as unmodified and tells the browser to use the cached version. Since some of the links may have changed at this point, this usually results in errors until the browser cache is reset.

I have developed a small JIB extension that goes through the build plan layers and moves all files according to defined filters to a separate new layer which has modification time set to build time. I believe that when configured properly (to only pick up resources which change with every release anyway) we get the best of both worlds - upon new releases only the layer containing the 'modified resources' will be pulled, and we will still have the benefit of nice server behaviour without much tinkering.

Please let me know if this makes sense and if I missed something. Will link the extension once I publish it.

EDIT: https://github.com/infobip/jib-layer-with-modification-time-extension-maven

tstavinoha avatar Feb 19 '21 07:02 tstavinoha

@tstavinoha any hope you contribute the extension to main jib-extensions repository?

rmannibucau avatar Jul 10 '23 18:07 rmannibucau

I'm open to doing that, but from the other discussions I'm getting the feeling that other project members would rather have this extension be external - as it is now - and I understand their points. So basically, in order not to waste my time, I can invest some time into moving this into jib-extensions if @chanseokoh agrees.

tstavinoha avatar Jul 11 '23 08:07 tstavinoha

I am no longer one of the maintainers of Jib, just occasionally helping and advising. As I said in https://github.com/GoogleContainerTools/jib/issues/4071#issuecomment-1629111628, I am punting the decision to @GoogleContainerTools/cloud-java-team-teamsync.

chanseokoh avatar Jul 11 '23 14:07 chanseokoh

Thanks again for implementing the extension @tstavinoha! We have decided to keep it as a third-party extension at this time. Folks who are interested in using it can also reference it from this README page.

mpeddada1 avatar Oct 24 '23 19:10 mpeddada1