vic
vic copied to clipboard
Needs test: Context deadline during Pull causes an already exists error from govmomi on a subsequent pull
During a pull busybox
it looks like a context deadline occurred somewhere on the portlayer side during a write image causing an inconsistent state. On repeat of the pull the layer is found to already exists after trying a new right. On a subsequent create the busybox image is still not found and an attempt to pull it is made. The appropriate response should have been that the layer already exists on the second pull.
% sudo ./vic-machine-linux create --target skinner.eng.vmware.com --user root --image-datastore Datastore --password ca\$hc0w --bridge-network vch-test --name mavery-vch-1 --compute-resource /Datacenter/host/Austin/Resources/mavery --volume-store=test-store:Datastore/testStore --generate-cert=false
INFO[2016-07-07T14:26:22-05:00] ### Installing VCH ####
WARN[2016-07-07T14:26:22-05:00] Configuring without TLS - to enable use -generate-cert or -key/-cert parameters
INFO[2016-07-07T14:26:22-05:00] Validating supplied configuration
INFO[2016-07-07T14:26:22-05:00] Firewall status: DISABLED on /Datacenter/host/Austin/patty.eng.vmware.com
INFO[2016-07-07T14:26:22-05:00] Firewall configuration OK on hosts:
INFO[2016-07-07T14:26:22-05:00] /Datacenter/host/Austin/patty.eng.vmware.com
INFO[2016-07-07T14:26:23-05:00] License check OK on hosts:
INFO[2016-07-07T14:26:23-05:00] /Datacenter/host/Austin/patty.eng.vmware.com
INFO[2016-07-07T14:26:23-05:00] DRS check OK on:
INFO[2016-07-07T14:26:23-05:00] /Datacenter/host/Austin/Resources/mavery
INFO[2016-07-07T14:26:27-05:00] Creating Resource Pool mavery-vch-1
INFO[2016-07-07T14:26:27-05:00] Creating directory [Datastore] testStore
INFO[2016-07-07T14:26:27-05:00] Datastore path is [Datastore] testStore
INFO[2016-07-07T14:26:27-05:00] Creating appliance on target
INFO[2016-07-07T14:26:27-05:00] Network role client is sharing NIC with external
INFO[2016-07-07T14:26:27-05:00] Network role management is sharing NIC with external
INFO[2016-07-07T14:26:30-05:00] Uploading images for container
INFO[2016-07-07T14:26:30-05:00] bootstrap.iso
INFO[2016-07-07T14:26:30-05:00] appliance.iso
INFO[2016-07-07T14:26:42-05:00] Registering VCH as a vSphere extension
INFO[2016-07-07T14:26:51-05:00] Waiting for IP information
INFO[2016-07-07T14:28:51-05:00] Waiting for major appliance components to launch
INFO[2016-07-07T14:28:51-05:00] Initialization of appliance successful
INFO[2016-07-07T14:28:51-05:00]
INFO[2016-07-07T14:28:51-05:00] SSH to appliance (default=root:password)
INFO[2016-07-07T14:28:51-05:00] ssh [email protected]
INFO[2016-07-07T14:28:51-05:00]
INFO[2016-07-07T14:28:51-05:00] Log server:
INFO[2016-07-07T14:28:51-05:00] http://10.17.109.94:2378
INFO[2016-07-07T14:28:51-05:00]
INFO[2016-07-07T14:28:51-05:00] DOCKER_HOST=10.17.109.94:2375
INFO[2016-07-07T14:28:51-05:00] DOCKER_OPTS="-H 10.17.109.94:2375"
INFO[2016-07-07T14:28:51-05:00]
INFO[2016-07-07T14:28:51-05:00] Connect to docker:
INFO[2016-07-07T14:28:51-05:00] docker -H 10.17.109.94:2375 info
INFO[2016-07-07T14:28:51-05:00] Installer completed successfully
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
% docker -H 10.17.109.94:2375 pull busybox
Using default tag: latest
latest: Pulling from library/busybox
a3ed95caeb02: Extracting [==================================================>] 32 B/32 B
8ddc19f16526: Pull complete
Failed to write to image store: context deadline exceeded
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
% docker -H 10.17.109.94:2375 pull busybox
Using default tag: latest
latest: Pulling from library/busybox
8ddc19f16526: Already exists
a3ed95caeb02: Extracting [==================================================>] 32 B/32 B
Failed to write to image store: [POST /storage/{store_name}][500] WriteImage default &{Code:0xc82039f368 Message:Cannot complete the operation because the file or folder [Datastore] VIC/4223eb68-0def-b8f7-2f5c-4d5da6568e9d/images/b05baf071fd542c3146f08e5f20ad63e76fa4b4bd91f274c4838ddc41f3409f8/b05baf071fd542c3146f08e5f20ad63e76fa4b4bd91f274c4838ddc41f3409f8.vmdk already exists}
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
%
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
% docker -H 10.17.109.94:2375 create -v /testVol busybox /bin/sh
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
8ddc19f16526: Already exists
a3ed95caeb02: Already exists
Digest: sha256:65ce39ce3eb0997074a460adfb568d0b9f0f6a4392d97b6035630c9d7bf92402
Status: Image is up to date for library/busybox:latest
08ac166ac3736a481bc32627e465a8140ce126617edb6d43185b05b83b4a9c3a
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
% docker -H 10.17.109.94:2375 create -v /testVol busybox /bin/sh
8baba67575c9c26d4b90eb83e61f465629a498214494092730a34354ab3f7a7d
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
% docker -H 10.17.109.94:2375 pull busybox
Using default tag: latest
latest: Pulling from library/busybox
8ddc19f16526: Already exists
a3ed95caeb02: Already exists
Digest: sha256:65ce39ce3eb0997074a460adfb568d0b9f0f6a4392d97b6035630c9d7bf92402
Status: Image is up to date for library/busybox:latest
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
assigning to @hickeng for triage
Assigning @jzt to look into the context timeout - pulls could take a very long time with a slow network connection; this is not an operation that we should timeout (unless docker has a timeout here, in which case it should be on the personality side). Putting in initial estimate for investigation and fix of timeout.
#933 is the issue that will address the already exist/re-pull behaviour.
There are currently no context timeouts from the persona to the portlayer, and imagec uses a 3600 second timeout for pulls, so I don't think that's it. There is one in the disk package (5 seconds) that may be where this came from:
https://github.com/vmware/vic/blob/9609f26b2b09861b5f7432a3e68770405af8799b/pkg/vsphere/disk/util.go#L40
This is exercised during the WriteImage
path. If you are still seeing these timeouts, this is probably the culprit.
Tagging @fdawg4l in case he has something to add.
Sounds reasonable. We can modify the timeout in the VCH; the time it takes to create the vmdk and attach it to the VCH is completely up to vsphere. I don't know what the upper bound of that operation is on a busy system. @hickeng?
@fdawg4l Almost unbounded I think. We're generally moving to a cancellation based approach in the PL, and the caller (in this case the personality) can furnish timeouts if that's required semantics.
Sounds good. We'll move to cancellation and nuke the timeout in the disk
pkg.
As far as this issue goes, we need 2 things 1 - We need to prevent more than 1 image being pulled at the same time 2 - We need to cleanup when a pull fails.
2 is done. We merged this a little over a month ago. We just need 1 now and to remove the timeout. That is not to say we aren't going to see a timeout somewhere else in the stack- does the docker client have a timeout? Do the handlers in the docker daemon have a default timeout per handler?
We can solve the things mentioned, but largely close the issue since subsquent pulls already work. Timeouts / cancellation in general need their own epics.
We can solve the things mentioned, but largely close the issue since subsquent pulls already work. Timeouts / cancellation in general need their own epics.
#2385
Workaround: Issue the command again and it should work.
This may involve a fair amount of code change, and since there is a workaround this may not be a big issue for 0.8.0. It also occurs rarely, so it might be best if we table this one for now. @mdubya66 @hickeng
@matthewavery, @jzt's recent imagec
changes prevent multiple pulls of the same layer id from happening. It's likely this is now fixed.
Already in the release notes, so removing the kind/note tag.
Already in the release notes, so removing the kind/note tag.
Marking this a low priority est 2 issue now, with the assumption this is fixed already and now making this a test item to make sure we have a test that covers this scenario.
@jzt can you confirm that this is fixed?
@jzt ^^^ let's close if this is fixed
@corrieb @mhagen-vmware I'm down to close. I haven't seen/heard of this happening for quite a long time. @mhagen-vmware do you agree?
#5372 is another instance of this - as per https://github.com/vmware/vic/issues/5372#issuecomment-307493861 we should be deleting incomplete layers so they can be re-pulled. We shouldn't close this without a test that confirms correctness.
This looks to be resolved in 5372. Moving from bug to test issue. A test needs to be written to cover this situation.