vic icon indicating copy to clipboard operation
vic copied to clipboard

consistent docker exec failure after docker cp failures.

Open malikkal opened this issue 6 years ago • 7 comments

docker exec consistently fails after a docker cp failure.

@hickeng attaching log snippets as zip, please delete after download.

malikkal avatar May 02 '18 08:05 malikkal

@hickeng is it possible that guest tools is crashing due to a very large cp to/from the container? since an exec will need tether to reload part of the config to pick up the exec this could cause exec to always fail. With limited information that is one of my first thoughts as a possibility. I do not have error messages or logs at the moment.

matthewavery avatar May 02 '18 20:05 matthewavery

snipped the log file - stashed in onedrive, ask me for access.

tether.debug b0a8c8c57552 shows that we had a name collision when copying over the payload:

May  2 2018 12:07:15.257+08:00 DEBUG op=1.1: [NewOperation] op=1.1 [vic/lib/tether.toolboxOverrideArchiveRead:279]
May  2 2018 12:07:15.258+08:00 DEBUG op=1.1: Reading from tar archive to path : archive://?disk-label=containerfs&filter-spec=eyJJbmNsdXNpb25zIjp7fSwiRXhjbHVzaW9ucyI6e30sIlJlYmFzZVBhdGgiOiIiLCJTdHJpcFBhdGgiOiIifQ%3D%3D&skip-data=false&skip-recurse=false
May  2 2018 12:07:15.258+08:00 DEBUG op=1.1: decoded spec: {"Inclusions":{},"Exclusions":{},"RebasePath":"","StripPath":""}
May  2 2018 12:07:15.258+08:00 DEBUG op=1.1: unpacking archive to root: /, filter: &{Inclusions:map[] Exclusions:map[] RebasePath: StripPath:}
May  2 2018 12:07:15.258+08:00 DEBUG op=1.1: using FilterSpec : (archive.FilterSpec{Inclusions:map[string]struct {}{}, Exclusions:map[string]struct {}{}, RebasePath:"", StripPath:""})
May  2 2018 12:07:15.258+08:00 DEBUG op=1.1: processing tar header: asset(tmp), size(270632960)
May  2 2018 12:07:15.258+08:00 ERROR op=1.1: Failed to open file /tmp: open /tmp: is a directory
May  2 2018 12:07:15.258+08:00 ERROR op=1.1: open /tmp: is a directory
May  2 2018 12:07:15.258+08:00 DEBUG op=1.1: Finished reading from tar archive to path : archive://?disk-label=containerfs&filter-spec=eyJJbmNsdXNpb25zIjp7fSwiRXhjbHVzaW9ucyI6e30sIlJlYmFzZVBhdGgiOiIiLCJTdHJpcFBhdGgiOiIifQ%3D%3D&skip-data=false&skip-recurse=false
2018/05/02 12:07:15 [hgfs] op=26 error: open /tmp: is a directory
2018/05/02 12:07:15 [hgfs] op=3 error: open /tmp: is a directory

This was a JVM tar archive that did not contain a tmp entry, so working on the assumption that this is an outer tar wrapping the payload. This may or may not have had some copies to the cVM attempted prior to having the Virtual Machine->Guest->Query permission added so perhaps related to inability to get directly listing for target path or similar (pure speculation).

No solid hypothesis about why this would inhibit future execs from succeeding. We would expect log lines such as the following when an exec is issued:

May  2 2018 12:08:15.966+08:00 DEBUG [BEGIN]  [vic/lib/tether.ReloadConfig:50]
May  2 2018 12:08:15.966+08:00 DEBUG [ END ]  [vic/lib/tether.ReloadConfig:50] [77.501µs] 
May  2 2018 12:08:15.966+08:00 INFO  Reloading tether configuration
May  2 2018 12:08:15.967+08:00 DEBUG [BEGIN]  [vic/lib/tether.(*tether).Reload:614]
May  2 2018 12:08:15.967+08:00 DEBUG [ END ]  [vic/lib/tether.(*tether).Reload:614] [145.441µs] 
May  2 2018 12:08:15.967+08:00 INFO  Loading main configuration
May  2 2018 12:08:15.997+08:00 WARN  Session b0a8c8c57552d24d08d84db50032794910db324d3403632a40eb385cad47dd5d already initialized

however after this last exec (which occurs after the failed copy so uncertain cause/effect correlation) we never see the log about "Loading main configuration". The fragment below shows the trigger for the reload after exec, but without actually processing the config (see the time gap between [END]...Reload:614 and the renewing IP message):

May  2 2018 14:55:11.220+08:00 DEBUG [BEGIN]  [vic/lib/tether.ReloadConfig:50]
May  2 2018 14:55:11.220+08:00 DEBUG [ END ]  [vic/lib/tether.ReloadConfig:50] [113.072µs] 
May  2 2018 14:55:11.220+08:00 INFO  Reloading tether configuration
May  2 2018 14:55:11.221+08:00 DEBUG [BEGIN]  [vic/lib/tether.(*tether).Reload:614]
May  2 2018 14:55:11.222+08:00 DEBUG [ END ]  [vic/lib/tether.(*tether).Reload:614] [61.082µs] 
May  2 2018 14:56:22.051+08:00 INFO  renewing IP address for network 

Possible failure to release lock after copy failure? (pure speculation)

Another possible path for investigation is that we have the following session waiting for interactive attach that never happens and therefore the exec session never unblocks:

May  2 2018 12:08:16.156+08:00 INFO  Waiting clear signal to launch 28ee85b14c86941361e9ac4b5638a8392371da24c925d9390cf4dff0cf279f72

Can we perform a reload while waiting for a clear to launch signal? Confirm expectation by code inspection initially.

hickeng avatar May 03 '18 08:05 hickeng

Confirmed via code inspection that the issue here preventing additional execs is that we're Waiting clear signal to launch for session 28ee85.... See https://github.com/vmware/vic/blob/320a12a44c205866b5500a610124fc42724fa4f7/lib/tether/tether.go#L477 - we're blocking waiting for the launch to complete.

It should be possible to confirm this very simply with an additional tether unit test (https://github.com/vmware/vic/blob/master/cmd/tether/attach_test.go).

hickeng avatar May 03 '18 08:05 hickeng

@hickeng , Did I get what you mean? Is t.launch failing to get ClearToLaunch signal that prevents docker exec finish ? How does docker cp affect this signal?

YanzhaoLi avatar Feb 21 '19 10:02 YanzhaoLi

@malikkal Can you confirm if this is still occurring with 1.5.0.

@YanzhaoLi The basic issue is that a lock is held and not released. In the specific example we had logs for, that lock was held by an exec session waiting for launch. It's possible that #8098 addressed this if it was this causing the wait - basically the portlayer would never have known it was ok to attach the interactive session.

The other possibility, as noted in https://github.com/vmware/vic/issues/7878#issuecomment-386227801 is that the docker cp failure left a lock held that blocked the reload.

hickeng avatar Feb 21 '19 20:02 hickeng

@malikkal any updates for George's comment?

renmaosheng avatar Mar 11 '19 09:03 renmaosheng

@renmaosheng @hickeng Thank you. I didnt have an opportunity to test and reproduce this in 1.5.0 yet. I will update shortly.

malikkal avatar Mar 11 '19 15:03 malikkal