singularity icon indicating copy to clipboard operation
singularity copied to clipboard

Image generation should provide hash

Open vsoch opened this issue 8 years ago • 36 comments

Any time an image is created, bootstrapped, or otherwise written upon finish we should provide the user with a hash for it. It should also be set to some environment variable to other processes (e.g. Image builder) can access it. Should it be stores equivalently with the image and updated appropriately, perhaps somewhere in .env?

vsoch avatar Dec 09 '16 16:12 vsoch

Great idea, and I think this should be part of the image header itself.

gmkurtzer avatar Dec 09 '16 20:12 gmkurtzer

Greg and I have discussed this and it seems to be best that we implement this in this way:

  1. Hash is stored in the image header, along with a bit describing if the image is static, and a bit describing if the image was run with -w at any point after the last hashing.
  2. singularity hash image.img - Generates the SHA256 hash of the image, not including the header. Resets the bit describing if the image was run with -w to 0
  3. Bootstrap flag that creates the image in read only mode, by setting the bit that marks if the image is static in the header. This bit will be checked at runtime, and will never allow an image with this bit set to be run with -w

bauerm97 avatar Dec 12 '16 10:12 bauerm97

@gmkurtzer so right now you are generating a random uuid (uuid4?) are you happy with this over a content hash? I think they are actually different! uuid is for a particular instance of an image, and content hash reflects what is inside. For example, this is what we currently have:

Adding label: 'SINGULARITY_CONTAINER_UUID' = '98e7add0-268f-4463-b3fc-8bee21ddde88'
Adding label: 'SINGULARITY_DEFFILE' = 'Singularity'
Adding label: 'SINGULARITY_BOOTSTRAP_VERSION' = '2.2.99'
Adding label: 'SINGULARITY_DEFFILE_BOOTSTRAP' = 'docker'
Adding label: 'SINGULARITY_DEFFILE_FROM' = 'vanessa/pefinder'
Adding base Singularity environment to container

and I could see having the following:

Adding label: 'SINGULARITY_CONTAINER_UUID' = '98e7add0-268f-4463-b3fc-8bee21ddde88'
Adding label: 'SINGULARITY_DEFFILE' = 'Singularity'
Adding label: 'SINGULARITY_BOOTSTRAP_VERSION' = '2.2.99'
Adding label: 'SINGULARITY_DEFFILE_BOOTSTRAP' = 'docker'
Adding label: 'SINGULARITY_DEFFILE_FROM' = 'vanessa/pefinder'
Adding base Singularity environment to container

bla bla bla bootstrap
Adding label: 'SINGULARITY_CONTAINER_HASH' = '98e7268f4463b3fc8bee21ddde88'

where the has is based on the contents. Are we planning to wait to add this to the header proper? Is there any reason it can't be supported as a label for now? it's relatively easy to do with python at least, and I'm guessing python just uses system utils. The hash is of course different depending on timestamps, but given a dump of equivalent layers, and if we implemented some kind of strategy that hashed the core container contents taking timestamp into account, and the /.singularity folder not taking it into account, we might be able to get the functionality that we want. Thoughts?

vsoch avatar Mar 20 '17 07:03 vsoch

I like the idea of a content hash, but not sure the best model for creating one that doesn't have massive IO overhead. Thoughts?

gmkurtzer avatar Mar 20 '17 16:03 gmkurtzer

hmm, I can think of ways to have a "quasi hash" (accounting for the layers that get imported if given docker, and the contents of .singularity but that doesn't help for imports that don't come with those numbers already. Let's keep this open and rethink a bit later, too much to do today and this week, lol.

vsoch avatar Mar 20 '17 17:03 vsoch

We could also run through the container's files, and do some hashing based on what is found. While it is IO heavy, that would guarantee two containers with the same hash are equal.

Let's put it on rocks, and come back to it. :)

gmkurtzer avatar Mar 20 '17 17:03 gmkurtzer

I already have a function in my package library for singularity-python that uses hashlib to something like this for singularity hub, eg:

def get_image_hash(image_path):
    '''get_image_hash will return an md5 hash of the file. Since we don't have git commits
    this seems like a reasonable option to "version" an image, since we can easily say yay or nay
    if the image matches the spec file
    :param image_path: full path to the singularity image
    '''
    hash_md5 = hashlib.md5()
    with open(image_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

It runs fairly quickly, maybe 2-3 seconds per image? I'm running it en-masse now for the shub analysis and I barely notice it, lol.

vsoch avatar Mar 20 '17 17:03 vsoch

Oh, interesting... Pardon my lack of Pythonese, does it run recursively through all the files within image_path?

gmkurtzer avatar Mar 20 '17 23:03 gmkurtzer

I think it reads it in binary chunks (of size 4096) and updates the hashlib md5 sum thing with the new chunk. Then at the end, you get the hexdigest! The open() loop thing just means that it does it while reading the file, versus opening, reading, and then manually closing. Here this might be more informative:

https://docs.python.org/2/library/hashlib.html

Thoughts? It works pretty quickly!

vsoch avatar Mar 20 '17 23:03 vsoch

Ohhhh, this is doing a checksum of the image file itself... No, I was thinking that to do this in a reproducible manner, we need to select certain paths inside the container to hash together and checksum the total. That way we can identify the inside contents. The image itself won't do any good to checksum, we might as well use a random UUID then.

gmkurtzer avatar Mar 21 '17 00:03 gmkurtzer

yeah I learned this when I used the function on 100 identical images and got different results, and then :scream:

vsoch avatar Mar 21 '17 00:03 vsoch

could you expand on "select certain paths?" Which ones wouldn't be selected?

vsoch avatar Mar 21 '17 00:03 vsoch

Well.. once the container is mounted, we could traverse recursively some key directories (perhaps, /etc, /bin, /usr, /lib*) and run these files within these paths through a checksumming algorithm, so at the end we get a single checksum that represents the content of image. Gross, but that would work.

gmkurtzer avatar Mar 21 '17 00:03 gmkurtzer

hmm, but I would argue the opposite - custom user software is probably not going to be where it's supposed to be, because, you know, we're terrible at that :) For example, I put everything in /code or /data.

I guess it comes down to what are the goals of the content hash?

vsoch avatar Mar 21 '17 00:03 vsoch

That's a good point, what are our goals of the content hash? I think I'll defer to @vsoch on that one. lol

gmkurtzer avatar Mar 21 '17 00:03 gmkurtzer

I think the content hash is to see if "the guts of my thing are the same as your thing" - the use case I have now is generating and downloading 100 containers that were generated from the same spec file. The md5 sum is different (different timestamps) but I'd want to be able to show that they really are the same! This is relevant for the shub paper which is waiting on me to figure this out, lol.

vsoch avatar Mar 23 '17 02:03 vsoch

ok I just came up with a really retarded way of doing this, but I think it's a reasonable approach to start with - basically I create a tar archive of an image, and then I can select a subset of files to include in the sum. For example, here I am using the same set of images (that before had different hashes when doing the entire image file) for JUST files in /bin

for image_file in image_files:
   os.system('sudo singularity export %s > tmp.tar' %(image_file))
   summy = tarsum('tmp.tar')
   print(summy)

## -- End pasted text --
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709

whala! same thing! So - this is super cool because we can actually give the user different levels of reproducibility, depending on how we filter what files/folder paths should be included in the sum. I think this alone, or doing a simple analysis to look across an image and "sniff" which folders are consistent and which not (eg, I'd imagine the package lists are NOT, and /tmp is NOT) etc would be super cool for the shub paper, and very useful. Arguably if you download an image that should be exactly the same file, it should have the strictest level). In my example, I'm generating the same image from the same spec but have different image files, so I'd want to be more leniant. I'm going back to sleep now, but this I think is going to be a fun algorithm to think about and implement, and then we can define some optimized set of folders/files for the different cases. Let me know if you have thoughts on this!

vsoch avatar Mar 23 '17 09:03 vsoch

oh and here is my rough scratch code for the function, don't worry will clean this up and optimize to probably not iterate over ALL files :)

import hashlib
import tarfile
import sys
import os
import re

def include_file(member_path):
    if member.name.startswith("/bin"):
    #member_path = member_path.replace('.','',1)
    #if re.search('^/usr',member_path):
        return True
    return False

def tarsum(input_file):
    input_file = open(input_file, "r")
    tar = tarfile.open(mode="r|*", fileobj=input_file) 
    chunk_size = 100*1024
    hasher = hashlib.sha1()
    for member in tar:
        if member.isfile() and include_file(member):
            filey = tar.extractfile(member)
            buf = filey.read(chunk_size)
            while buf:
                hasher.update(buf)
                buf = filey.read(chunk_size)
    return hasher.hexdigest()

I also need to think about using sha1 vs md5 vs sha256. Probably the first two are imperfect but would work, the latter is better but slower. Ahh... tradeoffs!

vsoch avatar Mar 23 '17 09:03 vsoch

Hi,

So I've been creating a checksum for my image (e.g. shasum -a 256 test.img), and today I noticed that a small change that I made in %runscript section didn't actually change the checksum. 😱 I'm using version 2.2.1 and would like to know the best way to make a hash for my image. or is the has already being provided somewhere?

hisplan avatar May 17 '17 17:05 hisplan

hey @hisplan ! We had originally intended this to be part of Singularity proper, and at one point @gmkurtzer added a random uuid generation (not a hash) but I don't think this was developed further. I did more work on these hash sums for a paper that I am working on, but no decisions / finality ever was integrated into the software proper. I came up with the above hoping to add, but it looks like the issue got lost in the nethers. @gmkurtzer, did you intend to add for 2.3?

vsoch avatar May 17 '17 17:05 vsoch

The work I was doing wound up here and for that specific function, I was using md5. Likely for the image we should use sha256. Container comparison is interesting because the "level of reproducibility" someone might be interested in isn't a black and white thing. Is it the same image, exactly down to the file? Was it built from the same spec? Is the runscript the same? These are all important questions that I started work on - see examples of the different levels here.

vsoch avatar May 17 '17 18:05 vsoch

Same image, same layers, everything exactly the same except one thing: I added 9 additional characters in the %runscript section. e.g. from Rscript xyz.R to Rscript --vanilla xyz.R I naturally (naively) thought that this would generate a different hash, but it didn't. How come?

hisplan avatar May 17 '17 18:05 hisplan

To be more clear, this is not implemented. I don't know what @gmkurtzer is using to generate a unique id, or other, but it isn't related to any of the methods I discussed or showed above.

vsoch avatar May 17 '17 19:05 vsoch

Right, as a matter of fact, I didn't actually expect Singularity to do anything special. I was just using Linux sha1sum/md5sum to get the checksum of the final Singularity image. I'm wondering why the checksum didn't change even though I did make some changes to the image...

hisplan avatar May 17 '17 19:05 hisplan

oh I see! Are you making changes to the actual runscript or the link to it? For example, the runscript is actually at /.singularity.d/runscript and the /singularity is just a link. Can you give me something to reproduce your error? What I did is:

sha1sum coffee.img 
3f6a212a0fc6d018457be2e6f3c8e8197abbbb43  coffee.img

# Here I am making a trivial change to runscript
sudo singularity shell --writable coffee.img 
Singularity: Invoking an interactive shell within container...

Singularity coffee.img:~> vim /.singularity.d/runscript 
Singularity coffee.img:~> exit

exit

# And now sha1sum is different
vanessa@vanessa-ThinkPad-T460s:~/Desktop$ sha1sum coffee.img b4cf583b5f3a6bf357d095137e7b5a7c2aa27244  coffee.img

vsoch avatar May 17 '17 19:05 vsoch

It seems to be working for me too:

gmk@gmkdev2:~/git/singularity$ shasum /tmp/centos.img dc73d6a9aad0b21c1a5972d0a49b7392213934f6 /tmp/centos.img gmk@gmkdev2:~/git/singularity$ sudo singularity exec -w /tmp/centos.img sh -c "echo '' >> /singularity" gmk@gmkdev2:~/git/singularity$ shasum /tmp/centos.img 65eccc149e6f565ba85509fb35c1e0737cb7bb09 /tmp/centos.img

Greg

On Wed, May 17, 2017 at 12:45 PM, Vanessa Sochat [email protected] wrote:

oh I see! Are you making changes to the actual runscript or the link to it? For example, the runscript is actually at /.singularity.d/runscript and the /singularity is just a link. Can you give me something to reproduce your error? What I did is:

sha1sum coffee.img 3f6a212a0fc6d018457be2e6f3c8e8197abbbb43 coffee.img

Here I am making a trivial change to runscript

sudo singularity shell --writable coffee.img Singularity: Invoking an interactive shell within container...

Singularity coffee.img:~> vim /.singularity.d/runscript Singularity coffee.img:~> exit

exit

And now sha1sum is different

vanessa@vanessa-ThinkPad-T460s:~/Desktop$ sha1sum coffee.img b4cf583b5f3a6bf357d095137e7b5a7c2aa27244 coffee.img

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/singularityware/singularity/issues/408#issuecomment-302211324, or mute the thread https://github.com/notifications/unsubscribe-auth/ANVkxaVo8UZyOybZDxIM7kND6C7S_zTqks5r605kgaJpZM4LJI9t .

gmkurtzer avatar May 17 '17 21:05 gmkurtzer

Okay, so I did some more experiments, and my bad, the checksum does changes, but it changes even though I build the same image (with the same content) again;;; I guess either way shasum against an image file is not suitable for image comparison. I guess this is the reason why you said the container comparison is an interesting topic.

hisplan avatar May 17 '17 23:05 hisplan

Even if a single file's timestamp changes inside a container, the container itself will have an entirely different hash/checksum. And things like timestamps will indeed change from one bootstrap to another, even with the same bootstrap recipe!

gmkurtzer avatar May 17 '17 23:05 gmkurtzer

yes! This is exactly why I made these different levels of reproducibility- you can generate the "same" image twice and it will be determined to be equal on the level REPLICATE but not IDENTICAL.

vsoch avatar May 18 '17 00:05 vsoch

Yep, exactly @vsoch! We need to still discuss how best to determine image equality. It is a good question and it needs answering!

gmkurtzer avatar May 18 '17 00:05 gmkurtzer