singularity
singularity copied to clipboard
Image generation should provide hash
Any time an image is created, bootstrapped, or otherwise written upon finish we should provide the user with a hash for it. It should also be set to some environment variable to other processes (e.g. Image builder) can access it. Should it be stores equivalently with the image and updated appropriately, perhaps somewhere in .env?
Great idea, and I think this should be part of the image header itself.
Greg and I have discussed this and it seems to be best that we implement this in this way:
- Hash is stored in the image header, along with a bit describing if the image is static, and a bit describing if the image was run with
-w
at any point after the last hashing. -
singularity hash image.img
- Generates the SHA256 hash of the image, not including the header. Resets the bit describing if the image was run with-w
to 0 - Bootstrap flag that creates the image in read only mode, by setting the bit that marks if the image is static in the header. This bit will be checked at runtime, and will never allow an image with this bit set to be run with
-w
@gmkurtzer so right now you are generating a random uuid (uuid4?) are you happy with this over a content hash? I think they are actually different! uuid is for a particular instance of an image, and content hash reflects what is inside. For example, this is what we currently have:
Adding label: 'SINGULARITY_CONTAINER_UUID' = '98e7add0-268f-4463-b3fc-8bee21ddde88'
Adding label: 'SINGULARITY_DEFFILE' = 'Singularity'
Adding label: 'SINGULARITY_BOOTSTRAP_VERSION' = '2.2.99'
Adding label: 'SINGULARITY_DEFFILE_BOOTSTRAP' = 'docker'
Adding label: 'SINGULARITY_DEFFILE_FROM' = 'vanessa/pefinder'
Adding base Singularity environment to container
and I could see having the following:
Adding label: 'SINGULARITY_CONTAINER_UUID' = '98e7add0-268f-4463-b3fc-8bee21ddde88'
Adding label: 'SINGULARITY_DEFFILE' = 'Singularity'
Adding label: 'SINGULARITY_BOOTSTRAP_VERSION' = '2.2.99'
Adding label: 'SINGULARITY_DEFFILE_BOOTSTRAP' = 'docker'
Adding label: 'SINGULARITY_DEFFILE_FROM' = 'vanessa/pefinder'
Adding base Singularity environment to container
bla bla bla bootstrap
Adding label: 'SINGULARITY_CONTAINER_HASH' = '98e7268f4463b3fc8bee21ddde88'
where the has is based on the contents. Are we planning to wait to add this to the header proper? Is there any reason it can't be supported as a label for now? it's relatively easy to do with python at least, and I'm guessing python just uses system utils. The hash is of course different depending on timestamps, but given a dump of equivalent layers, and if we implemented some kind of strategy that hashed the core container contents taking timestamp into account, and the /.singularity folder not taking it into account, we might be able to get the functionality that we want. Thoughts?
I like the idea of a content hash, but not sure the best model for creating one that doesn't have massive IO overhead. Thoughts?
hmm, I can think of ways to have a "quasi hash" (accounting for the layers that get imported if given docker, and the contents of .singularity
but that doesn't help for imports that don't come with those numbers already. Let's keep this open and rethink a bit later, too much to do today and this week, lol.
We could also run through the container's files, and do some hashing based on what is found. While it is IO heavy, that would guarantee two containers with the same hash are equal.
Let's put it on rocks, and come back to it. :)
I already have a function in my package library for singularity-python that uses hashlib to something like this for singularity hub, eg:
def get_image_hash(image_path):
'''get_image_hash will return an md5 hash of the file. Since we don't have git commits
this seems like a reasonable option to "version" an image, since we can easily say yay or nay
if the image matches the spec file
:param image_path: full path to the singularity image
'''
hash_md5 = hashlib.md5()
with open(image_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
It runs fairly quickly, maybe 2-3 seconds per image? I'm running it en-masse now for the shub analysis and I barely notice it, lol.
Oh, interesting... Pardon my lack of Pythonese, does it run recursively through all the files within image_path?
I think it reads it in binary chunks (of size 4096) and updates the hashlib md5 sum thing with the new chunk. Then at the end, you get the hexdigest! The open() loop thing just means that it does it while reading the file, versus opening, reading, and then manually closing. Here this might be more informative:
https://docs.python.org/2/library/hashlib.html
Thoughts? It works pretty quickly!
Ohhhh, this is doing a checksum of the image file itself... No, I was thinking that to do this in a reproducible manner, we need to select certain paths inside the container to hash together and checksum the total. That way we can identify the inside contents. The image itself won't do any good to checksum, we might as well use a random UUID then.
yeah I learned this when I used the function on 100 identical images and got different results, and then :scream:
could you expand on "select certain paths?" Which ones wouldn't be selected?
Well.. once the container is mounted, we could traverse recursively some key directories (perhaps, /etc, /bin, /usr, /lib*) and run these files within these paths through a checksumming algorithm, so at the end we get a single checksum that represents the content of image. Gross, but that would work.
hmm, but I would argue the opposite - custom user software is probably not going to be where it's supposed to be, because, you know, we're terrible at that :) For example, I put everything in /code or /data.
I guess it comes down to what are the goals of the content hash?
That's a good point, what are our goals of the content hash? I think I'll defer to @vsoch on that one. lol
I think the content hash is to see if "the guts of my thing are the same as your thing" - the use case I have now is generating and downloading 100 containers that were generated from the same spec file. The md5 sum is different (different timestamps) but I'd want to be able to show that they really are the same! This is relevant for the shub paper which is waiting on me to figure this out, lol.
ok I just came up with a really retarded way of doing this, but I think it's a reasonable approach to start with - basically I create a tar archive of an image, and then I can select a subset of files to include in the sum. For example, here I am using the same set of images (that before had different hashes when doing the entire image file) for JUST files in /bin
for image_file in image_files:
os.system('sudo singularity export %s > tmp.tar' %(image_file))
summy = tarsum('tmp.tar')
print(summy)
## -- End pasted text --
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
da39a3ee5e6b4b0d3255bfef95601890afd80709
whala! same thing! So - this is super cool because we can actually give the user different levels of reproducibility, depending on how we filter what files/folder paths should be included in the sum. I think this alone, or doing a simple analysis to look across an image and "sniff" which folders are consistent and which not (eg, I'd imagine the package lists are NOT, and /tmp is NOT) etc would be super cool for the shub paper, and very useful. Arguably if you download an image that should be exactly the same file, it should have the strictest level). In my example, I'm generating the same image from the same spec but have different image files, so I'd want to be more leniant. I'm going back to sleep now, but this I think is going to be a fun algorithm to think about and implement, and then we can define some optimized set of folders/files for the different cases. Let me know if you have thoughts on this!
oh and here is my rough scratch code for the function, don't worry will clean this up and optimize to probably not iterate over ALL files :)
import hashlib
import tarfile
import sys
import os
import re
def include_file(member_path):
if member.name.startswith("/bin"):
#member_path = member_path.replace('.','',1)
#if re.search('^/usr',member_path):
return True
return False
def tarsum(input_file):
input_file = open(input_file, "r")
tar = tarfile.open(mode="r|*", fileobj=input_file)
chunk_size = 100*1024
hasher = hashlib.sha1()
for member in tar:
if member.isfile() and include_file(member):
filey = tar.extractfile(member)
buf = filey.read(chunk_size)
while buf:
hasher.update(buf)
buf = filey.read(chunk_size)
return hasher.hexdigest()
I also need to think about using sha1 vs md5 vs sha256. Probably the first two are imperfect but would work, the latter is better but slower. Ahh... tradeoffs!
Hi,
So I've been creating a checksum for my image (e.g. shasum -a 256 test.img
), and today I noticed that a small change that I made in %runscript
section didn't actually change the checksum. 😱
I'm using version 2.2.1 and would like to know the best way to make a hash for my image. or is the has already being provided somewhere?
hey @hisplan ! We had originally intended this to be part of Singularity proper, and at one point @gmkurtzer added a random uuid generation (not a hash) but I don't think this was developed further. I did more work on these hash sums for a paper that I am working on, but no decisions / finality ever was integrated into the software proper. I came up with the above hoping to add, but it looks like the issue got lost in the nethers. @gmkurtzer, did you intend to add for 2.3?
The work I was doing wound up here and for that specific function, I was using md5. Likely for the image we should use sha256. Container comparison is interesting because the "level of reproducibility" someone might be interested in isn't a black and white thing. Is it the same image, exactly down to the file? Was it built from the same spec? Is the runscript the same? These are all important questions that I started work on - see examples of the different levels here.
Same image, same layers, everything exactly the same except one thing: I added 9 additional characters in the %runscript
section. e.g. from Rscript xyz.R
to Rscript --vanilla xyz.R
I naturally (naively) thought that this would generate a different hash, but it didn't. How come?
To be more clear, this is not implemented. I don't know what @gmkurtzer is using to generate a unique id, or other, but it isn't related to any of the methods I discussed or showed above.
Right, as a matter of fact, I didn't actually expect Singularity to do anything special. I was just using Linux sha1sum/md5sum to get the checksum of the final Singularity image. I'm wondering why the checksum didn't change even though I did make some changes to the image...
oh I see! Are you making changes to the actual runscript or the link to it? For example, the runscript is actually at /.singularity.d/runscript
and the /singularity
is just a link. Can you give me something to reproduce your error? What I did is:
sha1sum coffee.img
3f6a212a0fc6d018457be2e6f3c8e8197abbbb43 coffee.img
# Here I am making a trivial change to runscript
sudo singularity shell --writable coffee.img
Singularity: Invoking an interactive shell within container...
Singularity coffee.img:~> vim /.singularity.d/runscript
Singularity coffee.img:~> exit
exit
# And now sha1sum is different
vanessa@vanessa-ThinkPad-T460s:~/Desktop$ sha1sum coffee.img b4cf583b5f3a6bf357d095137e7b5a7c2aa27244 coffee.img
It seems to be working for me too:
gmk@gmkdev2:~/git/singularity$ shasum /tmp/centos.img dc73d6a9aad0b21c1a5972d0a49b7392213934f6 /tmp/centos.img gmk@gmkdev2:~/git/singularity$ sudo singularity exec -w /tmp/centos.img sh -c "echo '' >> /singularity" gmk@gmkdev2:~/git/singularity$ shasum /tmp/centos.img 65eccc149e6f565ba85509fb35c1e0737cb7bb09 /tmp/centos.img
Greg
On Wed, May 17, 2017 at 12:45 PM, Vanessa Sochat [email protected] wrote:
oh I see! Are you making changes to the actual runscript or the link to it? For example, the runscript is actually at /.singularity.d/runscript and the /singularity is just a link. Can you give me something to reproduce your error? What I did is:
sha1sum coffee.img 3f6a212a0fc6d018457be2e6f3c8e8197abbbb43 coffee.img
Here I am making a trivial change to runscript
sudo singularity shell --writable coffee.img Singularity: Invoking an interactive shell within container...
Singularity coffee.img:~> vim /.singularity.d/runscript Singularity coffee.img:~> exit
exit
And now sha1sum is different
vanessa@vanessa-ThinkPad-T460s:~/Desktop$ sha1sum coffee.img b4cf583b5f3a6bf357d095137e7b5a7c2aa27244 coffee.img
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/singularityware/singularity/issues/408#issuecomment-302211324, or mute the thread https://github.com/notifications/unsubscribe-auth/ANVkxaVo8UZyOybZDxIM7kND6C7S_zTqks5r605kgaJpZM4LJI9t .
Okay, so I did some more experiments, and my bad, the checksum does changes, but it changes even though I build the same image (with the same content) again;;; I guess either way shasum
against an image file is not suitable for image comparison. I guess this is the reason why you said the container comparison is an interesting topic.
Even if a single file's timestamp changes inside a container, the container itself will have an entirely different hash/checksum. And things like timestamps will indeed change from one bootstrap to another, even with the same bootstrap recipe!
yes! This is exactly why I made these different levels of reproducibility- you can generate the "same" image twice and it will be determined to be equal on the level REPLICATE but not IDENTICAL.
Yep, exactly @vsoch! We need to still discuss how best to determine image equality. It is a good question and it needs answering!