vic Needs test: Context deadline during Pull causes an already exists error from govmomi on a subsequent pull

During a pull busybox it looks like a context deadline occurred somewhere on the portlayer side during a write image causing an inconsistent state. On repeat of the pull the layer is found to already exists after trying a new right. On a subsequent create the busybox image is still not found and an attempt to pull it is made. The appropriate response should have been that the layer already exists on the second pull.

 % sudo ./vic-machine-linux create --target skinner.eng.vmware.com --user root --image-datastore Datastore --password ca\$hc0w --bridge-network vch-test --name mavery-vch-1 --compute-resource /Datacenter/host/Austin/Resources/mavery --volume-store=test-store:Datastore/testStore --generate-cert=false
INFO[2016-07-07T14:26:22-05:00] ### Installing VCH ####                      
WARN[2016-07-07T14:26:22-05:00] Configuring without TLS - to enable use -generate-cert or -key/-cert parameters 
INFO[2016-07-07T14:26:22-05:00] Validating supplied configuration            
INFO[2016-07-07T14:26:22-05:00] Firewall status: DISABLED on /Datacenter/host/Austin/patty.eng.vmware.com 
INFO[2016-07-07T14:26:22-05:00] Firewall configuration OK on hosts:          
INFO[2016-07-07T14:26:22-05:00]   /Datacenter/host/Austin/patty.eng.vmware.com 
INFO[2016-07-07T14:26:23-05:00] License check OK on hosts:                   
INFO[2016-07-07T14:26:23-05:00]   /Datacenter/host/Austin/patty.eng.vmware.com 
INFO[2016-07-07T14:26:23-05:00] DRS check OK on:                             
INFO[2016-07-07T14:26:23-05:00]   /Datacenter/host/Austin/Resources/mavery   
INFO[2016-07-07T14:26:27-05:00] Creating Resource Pool mavery-vch-1          
INFO[2016-07-07T14:26:27-05:00] Creating directory [Datastore] testStore     
INFO[2016-07-07T14:26:27-05:00] Datastore path is [Datastore] testStore      
INFO[2016-07-07T14:26:27-05:00] Creating appliance on target                 
INFO[2016-07-07T14:26:27-05:00] Network role client is sharing NIC with external 
INFO[2016-07-07T14:26:27-05:00] Network role management is sharing NIC with external 
INFO[2016-07-07T14:26:30-05:00] Uploading images for container               
INFO[2016-07-07T14:26:30-05:00]     bootstrap.iso                               
INFO[2016-07-07T14:26:30-05:00]     appliance.iso                               
INFO[2016-07-07T14:26:42-05:00] Registering VCH as a vSphere extension       
INFO[2016-07-07T14:26:51-05:00] Waiting for IP information                   
INFO[2016-07-07T14:28:51-05:00] Waiting for major appliance components to launch 
INFO[2016-07-07T14:28:51-05:00] Initialization of appliance successful       
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] SSH to appliance (default=root:password)     
INFO[2016-07-07T14:28:51-05:00] ssh [email protected]                        
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] Log server:                                  
INFO[2016-07-07T14:28:51-05:00] http://10.17.109.94:2378                     
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] DOCKER_HOST=10.17.109.94:2375                
INFO[2016-07-07T14:28:51-05:00] DOCKER_OPTS="-H 10.17.109.94:2375"           
INFO[2016-07-07T14:28:51-05:00]                                              
INFO[2016-07-07T14:28:51-05:00] Connect to docker:                           
INFO[2016-07-07T14:28:51-05:00] docker -H 10.17.109.94:2375 info             
INFO[2016-07-07T14:28:51-05:00] Installer completed successfully             
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 pull busybox
Using default tag: latest
latest: Pulling from library/busybox

a3ed95caeb02: Extracting [==================================================>]     32 B/32 B
8ddc19f16526: Pull complete 
Failed to write to image store: context deadline exceeded
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 pull busybox
Using default tag: latest
latest: Pulling from library/busybox

8ddc19f16526: Already exists 
a3ed95caeb02: Extracting [==================================================>]     32 B/32 B
Failed to write to image store: [POST /storage/{store_name}][500] WriteImage default  &{Code:0xc82039f368 Message:Cannot complete the operation because the file or folder [Datastore] VIC/4223eb68-0def-b8f7-2f5c-4d5da6568e9d/images/b05baf071fd542c3146f08e5f20ad63e76fa4b4bd91f274c4838ddc41f3409f8/b05baf071fd542c3146f08e5f20ad63e76fa4b4bd91f274c4838ddc41f3409f8.vmdk already exists}
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % 
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 create -v /testVol busybox /bin/sh
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox

8ddc19f16526: Already exists 
a3ed95caeb02: Already exists 
Digest: sha256:65ce39ce3eb0997074a460adfb568d0b9f0f6a4392d97b6035630c9d7bf92402
Status: Image is up to date for library/busybox:latest
08ac166ac3736a481bc32627e465a8140ce126617edb6d43185b05b83b4a9c3a
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 create -v /testVol busybox /bin/sh
8baba67575c9c26d4b90eb83e61f465629a498214494092730a34354ab3f7a7d
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin
 % docker -H 10.17.109.94:2375 pull busybox                      
Using default tag: latest
latest: Pulling from library/busybox

8ddc19f16526: Already exists 
a3ed95caeb02: Already exists 
Digest: sha256:65ce39ce3eb0997074a460adfb568d0b9f0f6a4392d97b6035630c9d7bf92402
Status: Image is up to date for library/busybox:latest
matthewavery@ubuntu ~/go/src/github.com/vmware/vic/bin

Jul 07 '16 19:07 matthewavery

assigning to @hickeng for triage

Jul 19 '16 13:07 mdubya66

Assigning @jzt to look into the context timeout - pulls could take a very long time with a slow network connection; this is not an operation that we should timeout (unless docker has a timeout here, in which case it should be on the personality side). Putting in initial estimate for investigation and fix of timeout.

#933 is the issue that will address the already exist/re-pull behaviour.

Jul 26 '16 21:07 hickeng

There are currently no context timeouts from the persona to the portlayer, and imagec uses a 3600 second timeout for pulls, so I don't think that's it. There is one in the disk package (5 seconds) that may be where this came from:

https://github.com/vmware/vic/blob/9609f26b2b09861b5f7432a3e68770405af8799b/pkg/vsphere/disk/util.go#L40

This is exercised during the WriteImage path. If you are still seeing these timeouts, this is probably the culprit.

Tagging @fdawg4l in case he has something to add.

Sep 14 '16 22:09 jzt

Sounds reasonable. We can modify the timeout in the VCH; the time it takes to create the vmdk and attach it to the VCH is completely up to vsphere. I don't know what the upper bound of that operation is on a busy system. @hickeng?

Sep 14 '16 22:09 fdawg4l

@fdawg4l Almost unbounded I think. We're generally moving to a cancellation based approach in the PL, and the caller (in this case the personality) can furnish timeouts if that's required semantics.

Sep 16 '16 22:09 hickeng

Sounds good. We'll move to cancellation and nuke the timeout in the disk pkg.

As far as this issue goes, we need 2 things 1 - We need to prevent more than 1 image being pulled at the same time 2 - We need to cleanup when a pull fails.

2 is done. We merged this a little over a month ago. We just need 1 now and to remove the timeout. That is not to say we aren't going to see a timeout somewhere else in the stack- does the docker client have a timeout? Do the handlers in the docker daemon have a default timeout per handler?

We can solve the things mentioned, but largely close the issue since subsquent pulls already work. Timeouts / cancellation in general need their own epics.

Sep 16 '16 23:09 fdawg4l

We can solve the things mentioned, but largely close the issue since subsquent pulls already work. Timeouts / cancellation in general need their own epics.

#2385

Sep 19 '16 23:09 fdawg4l

Workaround: Issue the command again and it should work.

Oct 10 '16 18:10 mdubya66

This may involve a fair amount of code change, and since there is a workaround this may not be a big issue for 0.8.0. It also occurs rarely, so it might be best if we table this one for now. @mdubya66 @hickeng

Nov 01 '16 22:11 matthewavery

@matthewavery, @jzt's recent imagec changes prevent multiple pulls of the same layer id from happening. It's likely this is now fixed.

Nov 02 '16 16:11 fdawg4l

Already in the release notes, so removing the kind/note tag.

Nov 07 '16 18:11 stuclem

Already in the release notes, so removing the kind/note tag.

Nov 07 '16 18:11 stuclem

Marking this a low priority est 2 issue now, with the assumption this is fixed already and now making this a test item to make sure we have a test that covers this scenario.

Feb 01 '17 16:02 mhagen-vmware

@jzt can you confirm that this is fixed?

Apr 13 '17 13:04 hickeng

@jzt ^^^ let's close if this is fixed

Jul 19 '17 19:07 corrieb

@corrieb @mhagen-vmware I'm down to close. I haven't seen/heard of this happening for quite a long time. @mhagen-vmware do you agree?

Jul 19 '17 19:07 jzt

#5372 is another instance of this - as per https://github.com/vmware/vic/issues/5372#issuecomment-307493861 we should be deleting incomplete layers so they can be re-pulled. We shouldn't close this without a test that confirms correctness.

Jul 19 '17 20:07 hickeng

This looks to be resolved in 5372. Moving from bug to test issue. A test needs to be written to cover this situation.

Oct 31 '17 18:10 mdubya66

vic vic copied to clipboard

Needs test: Context deadline during Pull causes an already exists error from govmomi on a subsequent pull

vic
vic copied to clipboard