vic icon indicating copy to clipboard operation
vic copied to clipboard

ci test 6-15-SysLog failed due to inconsistent pid for docker-engine-server

Open chengwang86 opened this issue 8 years ago • 11 comments

Seen in ci builds 11668, 11672 and 11673.

Test-Cases.Group6-VIC-Machine.6-15-Syslog-VCH-11668-3847-container-logs.zip

In that test, this line https://github.com/vmware/vic/blob/master/tests/test-cases/Group6-VIC-Machine/6-15-Syslog.robot#L57 obtained the pid of docker-engine-server as 286. However, later on when checking the syslog here , the syslog showed the following:

Jun 18 05:49:01 192.168.31.181 docker-engine-server[288]: Unable to load CAs for registry access in config
Jun 18 05:49:01 192.168.31.181 docker-engine-server[288]: Waiting for portlayer to come up
Jun 18 05:49:03 192.168.31.181 docker-engine-server[288]: Portlayer is up and responding to pings
Jun 18 05:49:03 192.168.31.181 docker-engine-server[288]: Refreshing repository cache

So the pids obtained from the syslog and from ps -C docker-engine-server -o pid= (L57) are inconsistent.

chengwang86 avatar Jun 18 '17 17:06 chengwang86

@hmahmood Any thoughts?

chengwang86 avatar Jun 18 '17 17:06 chengwang86

Also seen in https://ci.vcna.io/vmware/vic/11668.

anchal-agrawal avatar Jun 18 '17 19:06 anchal-agrawal

I think this is just a race. All the tests use the same syslog server, so the mismatch is because the logs are from another VCH. I think just adding retries to this test should fix this.

hmahmood avatar Jun 18 '17 20:06 hmahmood

Many tests are now failing this test. I don't know that it's the same because there's no syslog output retrieved rather than not finding the expected lines: https://ci.vcna.io/vmware/vic/12976

We do get a message that we connect to the syslog server: time="2017-08-04T22:43:01Z" level=info msg="successfully connected to syslog server" and we do not get overflow messages which I'd expect if they weren't being sent.

Last entry in the syslog log is from: Aug 2 02:05:37 Photon vic-init[185]: and that log has rolled over to syslog.1 which explains why we're getting an empty response back.

It does imply that the log message may need to include VCH name and the test should check for it - we do restrict by IP but that's going to be frequently reused in CI.

Other items of note:

  1. vic-init is horrifically verbose when run with debug levels - we need to revisit at what log levels it duplicates the log output of the sessions, and when that duplication goes to anywhere except the serial logging.
  2. vic-init never updates the IP address for logging with so always logs with Photon for all clients. This is going to cause significant confusion.

hickeng avatar Aug 06 '17 04:08 hickeng

Using a direct command to log to the syslog service also failed. No evidence of anything wrong in the journalctl unit logs, systemd thinks everything is OK.

Restart of service and retried the logging and it shows up in the syslog as expected. Command used to verify:

logger -n 192.168.31.25 -t ghicken-test -p user.notice "Direct manual test of status"

I do not believe this is due to rate limiting as all remote logging failed to show after the Aug 2nd log mentioned in the previous comment.

hickeng avatar Aug 06 '17 18:08 hickeng

https://ci.vcna.io/vmware/vic/13830 https://ci.vcna.io/vmware/vic/13833

anchal-agrawal avatar Sep 27 '17 03:09 anchal-agrawal

PID mismatch seen in:

https://ci.vcna.io/vmware/vic/16123 https://ci.vcna.io/vmware/vic/16126 https://ci.vcna.io/vmware/vic/16128

anchal-agrawal avatar Feb 07 '18 23:02 anchal-agrawal

I recommend updating the test to filter by IP address (or we update syslog support to log with VCH name). Can also include a date range (see here for use of system time from the VCH).

Just as bad as what we're seeing here would be the fact that there's likely to be similar PIDs between VCHs because there's so little that runs in the endpointVMs. We do not want to see old data and think it's good.

hickeng avatar Feb 08 '18 15:02 hickeng

Hasnt been seen since move to HaaS. Lowering priority to p2, check for pid collision.

sgairo avatar Mar 28 '18 18:03 sgairo

Seen in https://ci-vic.vmware.com/vmware/vic/19918/7

We logged into the endpointVM successfully but for some reason the remote execution of the ps command failed to provide the expected output. We do not check that the command succeeds and therefore look for a pattern in the syslog output that does not include the pid - unsurprisingly we do not find a match.

Logging into '10.158.214.11:22' as 'root'.
Read output: Warning: your password will expire in 0 days [1;31mroot@ [ [0m~[1;31m ]# [0m
...
Executing command 'ps -C docker-engine-server -o pid='. 
Command exited with return code 1.
${pid} = 

hickeng avatar Aug 30 '18 14:08 hickeng

Most recent comment https://github.com/vmware/vic/issues/5522#issuecomment-417350365 has been resolved with #8251 (#8250)

hickeng avatar Sep 05 '18 18:09 hickeng