Avoid multiple downloads for same image
Description
Currently when starting multiple clusters using the same image before the image is cached, we download the same image in parallel. The first download stores the file in the cache and the other downloaded files are dropped.
Possible flow
- if data file exists use it
- create cache directory
- take the cache directory lock
- if data file exits, release the lock and use it
- write the metadata files
- download to temporary file
- rename temporary file to data file
- at this point other downloads can use the data file
- release the cache directory lock
Possible issues
- stale locks - should not be possible since flock (and the windows locks) are released when the process terminates
- bug when you forget to release the lock after download - should not be possible with lockutil.WithDirLock.
- download takes too much time, maybe hang on inaccessible network - should not happen becuase of socket timeouts. Terminating limactl will abort the download or waiting on the cache directory lock.
@jandubois you was concerned about issues, anything to add?
Somehow related: https://github.com/lima-vm/lima/issues/1354 I did a quick look to the sources, but my Go foo is not good enough :(
@jandubois you was concerned about issues, anything to add?
No, just that we should not timeout waiting for the lock; we need to wait indefinitely and rely on the process holding the lock to eventually release the lock.
Otherwise you will have to start checking for this in scripts running limactl create, and then retry at the script level, and I really would like to avoid that.
It seems buggy now, starting to download both the snapshot and the release at the same time?
INFO[0001] Attempting to download the image arch=x86_64 digest="sha256:0e25ca6ee9f08ec5d4f9910054b66ae7163c6152e81a3e67689d89bd6e4dfa69" location="https://cloud-images.ubuntu.com/releases/24.04/release-20240821/ubuntu-24.04-server-cloudimg-amd64.img"
INFO[0001] Attempting to download the image arch=x86_64 digest= location="https://cloud-images.ubuntu.com/releases/24.04/release/ubuntu-24.04-server-cloudimg-amd64.img"
- https://github.com/lima-vm/lima/issues/1722
IIRC the behavior above (attempting to download the snapshot) happens because another process have the release file already created, but checksum is wrong (because it's still downloading). That's why 2nd lima process tries to go snapshot instead. Not 100% sure though.. I roughly remember something like that.
another process have the release file already created, but checksum is wrong (because it's still downloading).
This should not be possible with current code. The data file is created only when the download is completed.
But the checksum file is created only after the data file, so there is a tiny window when the checksum file is missing.