vic
vic copied to clipboard
consistent docker exec failure after docker cp failures.
docker exec consistently fails after a docker cp failure.
@hickeng attaching log snippets as zip, please delete after download.
@hickeng is it possible that guest tools is crashing due to a very large cp to/from the container? since an exec will need tether to reload part of the config to pick up the exec this could cause exec to always fail. With limited information that is one of my first thoughts as a possibility. I do not have error messages or logs at the moment.
snipped the log file - stashed in onedrive, ask me for access.
tether.debug b0a8c8c57552
shows that we had a name collision when copying over the payload:
May 2 2018 12:07:15.257+08:00 DEBUG op=1.1: [NewOperation] op=1.1 [vic/lib/tether.toolboxOverrideArchiveRead:279]
May 2 2018 12:07:15.258+08:00 DEBUG op=1.1: Reading from tar archive to path : archive://?disk-label=containerfs&filter-spec=eyJJbmNsdXNpb25zIjp7fSwiRXhjbHVzaW9ucyI6e30sIlJlYmFzZVBhdGgiOiIiLCJTdHJpcFBhdGgiOiIifQ%3D%3D&skip-data=false&skip-recurse=false
May 2 2018 12:07:15.258+08:00 DEBUG op=1.1: decoded spec: {"Inclusions":{},"Exclusions":{},"RebasePath":"","StripPath":""}
May 2 2018 12:07:15.258+08:00 DEBUG op=1.1: unpacking archive to root: /, filter: &{Inclusions:map[] Exclusions:map[] RebasePath: StripPath:}
May 2 2018 12:07:15.258+08:00 DEBUG op=1.1: using FilterSpec : (archive.FilterSpec{Inclusions:map[string]struct {}{}, Exclusions:map[string]struct {}{}, RebasePath:"", StripPath:""})
May 2 2018 12:07:15.258+08:00 DEBUG op=1.1: processing tar header: asset(tmp), size(270632960)
May 2 2018 12:07:15.258+08:00 ERROR op=1.1: Failed to open file /tmp: open /tmp: is a directory
May 2 2018 12:07:15.258+08:00 ERROR op=1.1: open /tmp: is a directory
May 2 2018 12:07:15.258+08:00 DEBUG op=1.1: Finished reading from tar archive to path : archive://?disk-label=containerfs&filter-spec=eyJJbmNsdXNpb25zIjp7fSwiRXhjbHVzaW9ucyI6e30sIlJlYmFzZVBhdGgiOiIiLCJTdHJpcFBhdGgiOiIifQ%3D%3D&skip-data=false&skip-recurse=false
2018/05/02 12:07:15 [hgfs] op=26 error: open /tmp: is a directory
2018/05/02 12:07:15 [hgfs] op=3 error: open /tmp: is a directory
This was a JVM tar archive that did not contain a tmp
entry, so working on the assumption that this is an outer tar wrapping the payload.
This may or may not have had some copies to the cVM attempted prior to having the Virtual Machine->Guest->Query
permission added so perhaps related to inability to get directly listing for target path or similar (pure speculation).
No solid hypothesis about why this would inhibit future execs from succeeding. We would expect log lines such as the following when an exec is issued:
May 2 2018 12:08:15.966+08:00 DEBUG [BEGIN] [vic/lib/tether.ReloadConfig:50]
May 2 2018 12:08:15.966+08:00 DEBUG [ END ] [vic/lib/tether.ReloadConfig:50] [77.501µs]
May 2 2018 12:08:15.966+08:00 INFO Reloading tether configuration
May 2 2018 12:08:15.967+08:00 DEBUG [BEGIN] [vic/lib/tether.(*tether).Reload:614]
May 2 2018 12:08:15.967+08:00 DEBUG [ END ] [vic/lib/tether.(*tether).Reload:614] [145.441µs]
May 2 2018 12:08:15.967+08:00 INFO Loading main configuration
May 2 2018 12:08:15.997+08:00 WARN Session b0a8c8c57552d24d08d84db50032794910db324d3403632a40eb385cad47dd5d already initialized
however after this last exec (which occurs after the failed copy so uncertain cause/effect correlation) we never see the log about "Loading main configuration". The fragment below shows the trigger for the reload after exec, but without actually processing the config (see the time gap between [END]...Reload:614
and the renewing IP
message):
May 2 2018 14:55:11.220+08:00 DEBUG [BEGIN] [vic/lib/tether.ReloadConfig:50]
May 2 2018 14:55:11.220+08:00 DEBUG [ END ] [vic/lib/tether.ReloadConfig:50] [113.072µs]
May 2 2018 14:55:11.220+08:00 INFO Reloading tether configuration
May 2 2018 14:55:11.221+08:00 DEBUG [BEGIN] [vic/lib/tether.(*tether).Reload:614]
May 2 2018 14:55:11.222+08:00 DEBUG [ END ] [vic/lib/tether.(*tether).Reload:614] [61.082µs]
May 2 2018 14:56:22.051+08:00 INFO renewing IP address for network
Possible failure to release lock after copy failure? (pure speculation)
Another possible path for investigation is that we have the following session waiting for interactive attach that never happens and therefore the exec session never unblocks:
May 2 2018 12:08:16.156+08:00 INFO Waiting clear signal to launch 28ee85b14c86941361e9ac4b5638a8392371da24c925d9390cf4dff0cf279f72
Can we perform a reload while waiting for a clear to launch signal? Confirm expectation by code inspection initially.
Confirmed via code inspection that the issue here preventing additional execs is that we're Waiting clear signal to launch
for session 28ee85...
.
See https://github.com/vmware/vic/blob/320a12a44c205866b5500a610124fc42724fa4f7/lib/tether/tether.go#L477 - we're blocking waiting for the launch to complete.
It should be possible to confirm this very simply with an additional tether unit test (https://github.com/vmware/vic/blob/master/cmd/tether/attach_test.go).
@hickeng , Did I get what you mean? Is t.launch
failing to get ClearToLaunch
signal that prevents docker exec
finish ? How does docker cp
affect this signal?
@malikkal Can you confirm if this is still occurring with 1.5.0.
@YanzhaoLi The basic issue is that a lock is held and not released. In the specific example we had logs for, that lock was held by an exec session waiting for launch. It's possible that #8098 addressed this if it was this causing the wait - basically the portlayer would never have known it was ok to attach the interactive session.
The other possibility, as noted in https://github.com/vmware/vic/issues/7878#issuecomment-386227801 is that the docker cp
failure left a lock held that blocked the reload.
@malikkal any updates for George's comment?
@renmaosheng @hickeng Thank you. I didnt have an opportunity to test and reproduce this in 1.5.0 yet. I will update shortly.