TeamcityDockerCloudPlugin icon indicating copy to clipboard operation
TeamcityDockerCloudPlugin copied to clipboard

Agents stuck

Open sharov opened this issue 8 years ago • 15 comments

Sometimes agents can stuck. I can't catch detail scenario to repeat this issue. But I think it might happen if change agent setting (Terminate instance idle time, Docker image, Maximum instance count & etc.). It looks like this 2017-02-09 9 42 08

I restart Teamcity for fixing this issue, but it isn't a right solution.

P.S. I've tried different version of plugin (0.2.3, 0.3.3)

sharov avatar Feb 10 '17 10:02 sharov

By "stuck" you mean that agents are no longer started to process build jobs?

This may commonly happen if you ran out of agent licenses. The next time it happens, could you check the number of license left in the "Agents" -> "Unauthorized" tab ? In some situation it may takes some time for licenses to be reclaimed from stopped agent (it's not totally clear why).

Are containers configured to be removed on exit?

Your teamcity-clouds.log file may be helpful as well to diagnose the issue further.

JeanRev avatar Feb 10 '17 10:02 JeanRev

By "stuck" you mean that agents are no longer started to process build jobs?

yes

The next time it happens, could you check the number of license left in the "Agents" -> "Unauthorized" tab ?

as you can see on the screenshot, I didn't have any unauthorized agents

Are containers configured to be removed on exit?

only one of the three

sharov avatar Feb 10 '17 10:02 sharov

Hum, yes, you don't have any unauthorized agents, but on this tab should also be displayed the amount of license left, for example: licenses

JeanRev avatar Feb 10 '17 10:02 JeanRev

But, "The disconnected agent will be removed from the authorized agents list and deleted from the system to free up TeamCity build agent licenses."

from here https://confluence.jetbrains.com/display/TCD10/TeamCity+Integration+with+Cloud+Solutions

So, it should not be licence issue.

I hope logs can help.

sharov avatar Feb 10 '17 11:02 sharov

In theory yes, agent licenses should be freed as soon as cloud agent are discarded.

What I personally experienced experienced however, is that under some conditions agent licenses are not immediately returned to the license pool. If the number of agent license left is down to 0, then it would definitively explains why no more agents are started...

JeanRev avatar Feb 10 '17 11:02 JeanRev

The issue happens again. teamcity-clouds.log is empty (I removed it before). In teamcity-server.log I've found such lines

[2017-02-10 12:42:28,699]   INFO - tbrains.buildServer.ACTIVITIES - Build BUILD_PROMOTION{id=19764} was added to queue by "'oleg' {id=3}" 
[2017-02-10 12:42:28,716]   INFO - tbrains.buildServer.ACTIVITIES - Build added to queue; Queued build {Build promotion {promotion id=19764, configuration={id=Deploy_ExperimentDeploy, internal id=bt11}, queued}, triggered by "Oleg"} 
[2017-02-10 12:42:55,297]   INFO -    jetbrains.buildServer.AGENT - Removing authorization token '' from agent with no name: "" {id=0, protocol=unidirectional, host=172.17.0.2:9090, agentTypeId=0, registered since 2017-02-10 12:42:55.295} 
[2017-02-10 12:42:55,297]   INFO -    jetbrains.buildServer.AGENT - Did not find agent by its authorization token '' and will search by name, agent details: "" {id=0}. 
[2017-02-10 12:42:55,302]   INFO -    jetbrains.buildServer.AGENT - Did not find agent by name, agent details: "" {id=0} 
[2017-02-10 12:42:55,303]   INFO -    jetbrains.buildServer.AGENT - Setting generated name "ip_172.17.0.2" to agent "" {id=2481} as it had no name defined. 
[2017-02-10 12:42:55,304]   INFO -    jetbrains.buildServer.AGENT - New agent authorization token "f757cd7094394823812b0f451999b7a2" was generated for agent "ip_172.17.0.2" {id=2481} 
[2017-02-10 12:42:55,305]   INFO -    jetbrains.buildServer.AGENT - Upgrade of agent "ip_172.17.0.2" {id=2481} has been scheduled, agent version: 42434, plugins signature: NA. Current version: 42434, plugins signature: 42434-md5-c49e4f3d5b45f9b4af7bbf1f058d3140 
[2017-02-10 12:42:55,306]   INFO -    jetbrains.buildServer.AGENT - Agent has been registered: "ip_172.17.0.2" {id=2481, protocol=unidirectional, host=172.17.0.2:9090, agentTypeId=31, pool=Auto Deploy, registered since 2017-02-10 12:42:55.295}, not running a build 
[2017-02-10 12:42:55,310]   INFO - tbrains.buildServer.ACTIVITIES - Agent ""ip_172.17.0.2" {id=2481}" was authorized with comment "Virtual agent is authorized automatically." 
[2017-02-10 12:42:55,311]   INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'Ur2vw84...' was created, agent: "romantic_payne/ip_172.17.0.2" {id=2481} 
[2017-02-10 12:42:55,856]   INFO -    jetbrains.buildServer.AGENT - Agent "romantic_payne/ip_172.17.0.2" {id=2481} has confirmed upgrade. 
[2017-02-10 12:42:55,887]   INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'Ur2vw84...' was closed, agent: "romantic_payne/ip_172.17.0.2" {id=2481} 
[2017-02-10 12:42:55,888]   INFO -    jetbrains.buildServer.AGENT - Agent "romantic_payne/ip_172.17.0.2" {id=2481} has been unregistered. Reason: Agent has unregistered (will upgrade) 
[2017-02-10 12:44:17,958]   INFO -    jetbrains.buildServer.AGENT - Removing authorization token '' from agent with no name: "" {id=0, protocol=unidirectional, host=172.17.0.2:9090, agentTypeId=0, registered since 2017-02-10 12:44:17.957} 
[2017-02-10 12:44:17,958]   INFO -    jetbrains.buildServer.AGENT - Did not find agent by its authorization token '' and will search by name, agent details: "" {id=0}. 
[2017-02-10 12:44:17,959]   INFO -    jetbrains.buildServer.AGENT - Did not find agent by name, agent details: "" {id=0} 
[2017-02-10 12:44:17,959]   INFO -    jetbrains.buildServer.AGENT - Setting generated name "ip_172.17.0.2-1" to agent "" {id=2482} as it had no name defined. 
[2017-02-10 12:44:17,961]   INFO -    jetbrains.buildServer.AGENT - New agent authorization token "881e08dbf48240dadeb0a9fd8784ebaf" was generated for agent "ip_172.17.0.2-1" {id=2482} 
[2017-02-10 12:44:17,962]   INFO -    jetbrains.buildServer.AGENT - Upgrade of agent "ip_172.17.0.2-1" {id=2482} has been scheduled, agent version: 42434, plugins signature: NA. Current version: 42434, plugins signature: 42434-md5-c49e4f3d5b45f9b4af7bbf1f058d3140 
[2017-02-10 12:44:17,966]   INFO -    jetbrains.buildServer.AGENT - Agent has been registered: "ip_172.17.0.2-1" {id=2482, protocol=unidirectional, host=172.17.0.2:9090, agentTypeId=31, pool=Auto Deploy, registered since 2017-02-10 12:44:17.957}, not running a build 
[2017-02-10 12:44:17,976]   INFO - tbrains.buildServer.ACTIVITIES - Agent ""ip_172.17.0.2-1" {id=2482}" was authorized with comment "Virtual agent is authorized automatically." 
[2017-02-10 12:44:17,977]   INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'mKk4I5m...' was created, agent: "condescending_wilson/ip_172.17.0.2-1" {id=2482} 
[2017-02-10 12:44:18,530]   INFO -    jetbrains.buildServer.AGENT - Agent "condescending_wilson/ip_172.17.0.2-1" {id=2482} has confirmed upgrade. 
[2017-02-10 12:44:18,555]   INFO - ldServer.AGENT.PollingProtocol - Polling Agent session 'mKk4I5m...' was closed, agent: "condescending_wilson/ip_172.17.0.2-1" {id=2482} 
[2017-02-10 12:44:18,556]   INFO -    jetbrains.buildServer.AGENT - Agent "condescending_wilson/ip_172.17.0.2-1" {id=2482} has been unregistered. Reason: Agent has unregistered (will upgrade) 

2017-02-10 16 54 36 2017-02-10 16 54 28 2017-02-10 16 54 17

sharov avatar Feb 10 '17 12:02 sharov

maybe it's related with https://youtrack.jetbrains.com/issue/TW-47747 I'll try to upgrade Teamcity

sharov avatar Feb 10 '17 13:02 sharov

Thank you for the log file.

I see nothing suspicious in it... The screenshot of the "Unauthorized" agents tabs tends to confirm however the initial assumption that you do not have any agent license left, the corresponding amount of licenses would otherwise be shown (I double checked this directly in the JSP fragment used to render this page). For some reason, the used licenses are not reclaimed.

Upgrading your TC installation can't hurt. I'll see in the meantime if I can reproduce the issue.

JeanRev avatar Feb 11 '17 12:02 JeanRev

I've upgraded Teamcity to version 10.0.4 and installed latest version of the plugin. Agents stuck again :(

teamcity-clouds.log:

[2017-02-14 07:58:59,268]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud agent "sad_ride/ip_172.17.0.2" {id=2835} is no longer recognized as cloud agent because instance has gone: profileId=VRDC-2, imageId=teamcity_deploy, instanceId=5047aabb-a7d9-4e3a-a30d-708c0571a855 
[2017-02-14 07:58:59,268]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance has gone (is not reported by cloud profile): profileId=VRDC-2, imageId=teamcity_deploy, instanceId=5047aabb-a7d9-4e3a-a30d-708c0571a855 
[2017-02-14 08:00:21,910]   INFO [Ping Executor 1] - ver.instances.AgentRemoverImpl - Removing cloud agent "sad_ride/ip_172.17.0.2" {id=2835} because: Cloud instance (id=5047aabb-a7d9-4e3a-a30d-708c0571a855, imageId=teamcity_deploy, profileId=VRDC-2) has gone (is not reported by cloud profile) 
[2017-02-14 08:00:23,099]   INFO [3 Flush Queue 1] - .instances.StartInstanceAction - Starting cloud instance: profile 'cloud_docker_agent_1'{id=VRDC-1}, teamcity_agent_python_35 (sharov/teamcity-agent:t2) / b644aefc-7eda-44e3-8501-7565acfb1b90, hash=Be1n2A5wClrDpFH2eWLoeqxkn54y9ppn, reason=Automatic start: Compatible with queued build Tests :: Production Integration Tests(promotionId=20548) 
[2017-02-14 08:00:23,099]   INFO [3 Flush Queue 1] - ocker.DefaultDockerCloudClient - Creating new instance from image: teamcity_agent_python_35 (sharov/teamcity-agent:t2) / b644aefc-7eda-44e3-8501-7565acfb1b90 
[2017-02-14 08:00:23,099]   INFO [3 Flush Queue 1] - ocker.DefaultDockerCloudClient - Reusing cloud instance c37b1581-439a-49ea-9582-302cd2eda2ea. 
[2017-02-14 08:00:23,099]   INFO [erTaskScheduler] - ocker.DefaultDockerCloudClient - Starting container c37b1581-439a-49ea-9582-302cd2eda2ea 
[2017-02-14 08:00:23,099]   INFO [erTaskScheduler] - ocker.DefaultDockerCloudClient - Reusing existing container: 99e17a214f71945a07d8cea0c3e86019b39f0d02c70b0663444c944c95d2da5e 
[2017-02-14 08:00:23,100]   WARN [3 Flush Queue 1] - .instances.StoppedInstanceTask - Instance has changed status from stopped to Scheduled to start: hopeful_rosalind, profile 'cloud_docker_agent_1'{id=VRDC-1} 
[2017-02-14 08:00:23,100]   INFO [3 Flush Queue 1] - .server.impl.CloudEventsLogger - Cloud instance start succeeded: profile 'cloud_docker_agent_1'{id=VRDC-1}, hopeful_rosalind 
[2017-02-14 08:00:24,043]   INFO [erTaskScheduler] - ocker.DefaultDockerCloudClient - Container 99e17a214f71945a07d8cea0c3e86019b39f0d02c70b0663444c944c95d2da5e started. 
[2017-02-14 08:00:24,043]   INFO [erTaskScheduler] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35, Instance: c37b1581-439a-49ea-9582-302cd2eda2ea, profile=VRDC-1 is marked with state: running. 
[2017-02-14 08:00:24,289]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'running' state, profile 'cloud_docker_agent_1'{id=VRDC-1}, hopeful_rosalind 
[2017-02-14 08:00:43,291]   INFO [nio-8111-exec-8] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35 (sharov/teamcity-agent:t2) / b644aefc-7eda-44e3-8501-7565acfb1b90, profile: profile 'cloud_docker_agent_1'{id=VRDC-1} was marked to CONTAIN agent 
[2017-02-14 08:00:43,292]   INFO [nio-8111-exec-8] - .server.impl.CloudEventsLogger - Detected cloud agent "ip_172.17.0.2" {id=2838}, profile 'cloud_docker_agent_1'{id=VRDC-1}, hopeful_rosalind 
[2017-02-14 08:00:44,290]   WARN [uled executor 2] - .instances.RunningInstanceTask - Failed to find agent by id=2838 for profile:profile 'cloud_docker_agent_1'{id=VRDC-1}, hopeful_rosalind 
[2017-02-14 08:01:54,238]   INFO [uled executor 3] - ances.KillIdleRunningInstances - Terminating instance hopeful_rosalind, profile 'cloud_docker_agent_1'{id=VRDC-1} with idle time 1 minutes (idle time is 1 minutes) 
[2017-02-14 08:01:54,240]   INFO [uled executor 3] - l.instances.StopInstanceAction - Terminating instance: hopeful_rosalind, profile 'cloud_docker_agent_1'{id=VRDC-1}, reason: idle timeout of 1 minutes reached 
[2017-02-14 08:01:54,240]   INFO [uled executor 3] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35 (sharov/teamcity-agent:t2) / b644aefc-7eda-44e3-8501-7565acfb1b90, profile: profile 'cloud_docker_agent_1'{id=VRDC-1} was marked to CONTAIN agent 
[2017-02-14 08:01:54,243]   INFO [uled executor 3] - ocker.DefaultDockerCloudClient - Request for terminating instance: hopeful_rosalind 
[2017-02-14 08:01:54,243]   INFO [uled executor 3] - ocker.DefaultDockerCloudClient - Scheduling cloud instance termination: hopeful_rosalind (client disposed: false). 
[2017-02-14 08:01:54,293]   INFO [uled executor 4] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopping' state, profile 'cloud_docker_agent_1'{id=VRDC-1}, hopeful_rosalind 
[2017-02-14 08:01:56,944]   INFO [erTaskScheduler] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35, Instance: c37b1581-439a-49ea-9582-302cd2eda2ea, profile=VRDC-1 is marked with state: stopped. 
[2017-02-14 08:01:59,294]   INFO [uled executor 3] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'cloud_docker_agent_1'{id=VRDC-1}, hopeful_rosalind

sharov avatar Feb 14 '17 08:02 sharov

after removing image & container

[2017-02-14 08:24:25,258]   INFO [erTaskScheduler] - ocker.DefaultDockerCloudClient - New container 868d599756803c4bbeb4b8b2ffd0586b10c10604f7c9ec2512069f9c7b123d14 created. 
[2017-02-14 08:24:26,787]   INFO [erTaskScheduler] - ocker.DefaultDockerCloudClient - Container 868d599756803c4bbeb4b8b2ffd0586b10c10604f7c9ec2512069f9c7b123d14 started. 
[2017-02-14 08:24:26,787]   INFO [erTaskScheduler] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35, Instance: 13a0ffd8-17b5-44e8-978f-1940281a2302, profile=VRDC-1 is marked with state: running. 
[2017-02-14 08:24:29,774]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'running' state, profile 'cloud_docker_agent_1'{id=VRDC-1}, grave_kilby 
[2017-02-14 08:24:41,699]   INFO [nio-8111-exec-7] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35 (sharov/teamcity-agent:t2) / b3620bfa-2c87-416e-9dc9-1b5664256df4, profile: profile 'cloud_docker_agent_1'{id=VRDC-1} was marked to CONTAIN agent 
[2017-02-14 08:24:41,702]   INFO [nio-8111-exec-7] - .server.impl.CloudEventsLogger - Detected cloud agent "ip_172.17.0.2-1" {id=2836}, profile 'cloud_docker_agent_1'{id=VRDC-1}, grave_kilby 
[2017-02-14 08:24:44,775]   WARN [uled executor 2] - .instances.RunningInstanceTask - Failed to find agent by id=2836 for profile:profile 'cloud_docker_agent_1'{id=VRDC-1}, grave_kilby 
[2017-02-14 08:25:14,689]   INFO [uled executor 5] - ances.KillIdleRunningInstances - Terminating instance grave_kilby, profile 'cloud_docker_agent_1'{id=VRDC-1} with idle time 1 minutes (idle time is 1 minutes) 
[2017-02-14 08:25:14,690]   INFO [uled executor 5] - l.instances.StopInstanceAction - Terminating instance: grave_kilby, profile 'cloud_docker_agent_1'{id=VRDC-1}, reason: idle timeout of 1 minutes reached 
[2017-02-14 08:25:14,691]   INFO [uled executor 5] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35 (sharov/teamcity-agent:t2) / b3620bfa-2c87-416e-9dc9-1b5664256df4, profile: profile 'cloud_docker_agent_1'{id=VRDC-1} was marked to CONTAIN agent 
[2017-02-14 08:25:14,691]   INFO [uled executor 5] - ocker.DefaultDockerCloudClient - Request for terminating instance: grave_kilby 
[2017-02-14 08:25:14,691]   INFO [uled executor 5] - ocker.DefaultDockerCloudClient - Scheduling cloud instance termination: grave_kilby (client disposed: false). 
[2017-02-14 08:25:14,777]   INFO [uled executor 2] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopping' state, profile 'cloud_docker_agent_1'{id=VRDC-1}, grave_kilby 
[2017-02-14 08:25:15,465]   INFO [erTaskScheduler] - r.impl.DBCloudStateManagerImpl - Image: teamcity_agent_python_35, Instance: 13a0ffd8-17b5-44e8-978f-1940281a2302, profile=VRDC-1 is marked with state: stopped. 
[2017-02-14 08:25:19,777]   INFO [uled executor 3] - .server.impl.CloudEventsLogger - Cloud instance entered 'stopped' state, profile 'cloud_docker_agent_1'{id=VRDC-1}, grave_kilby 

sharov avatar Feb 14 '17 08:02 sharov

Third restart of Teamcity helped. It seems there are race conditions.

sharov avatar Feb 14 '17 08:02 sharov

The log files are not depicting anything wrong. This is consistent with the fact that the TC server is refusing unilaterally to create new cloud instances (in all likelihood because of the license problem). This might either be a bug in the cloud API, or some kind of interaction problem between the plugin and server.

I wasn't able to reproduce the issue yet. To do so, it would be useful to have a setup as close as possible to yours. Could please specify:

  • For each profile you defined:
    • The configured idle time.
    • If the instance is configured to be stopped after the first build.
    • For each image in the profile:
      • If the container is configured to be removed on exit.
      • If the image is bound to an agent pool.

Alternatively, you can also attach to this issue the XML configuration files generated in the .BuildServer/config/_cloud-profiles/VRDC-X directory of the teamcity data folder (there is one of them for each profile you defined). Or you could send them to the following email address: revertera AT puzzle.ch

One last note: if you have any other plugin interacting with the TeamCity authorization of agents, I would suggest disabling them.

JeanRev avatar Feb 16 '17 06:02 JeanRev

Hi Oleg,

Thank your for the files (and sorry for the response delay). I took some time to reproduce your setup as close as possible to match your cloud profile, cloud images and agent pools.

I then did a stress test where a great amount of jobs where submitted to projects associated to each of the agent pool, leaving enough time between each batch for the agent containers to be disposed or stopped. After several hours running I'm however still unable to reproduce the issue: licenses are consistently and reliably returned to the license pool as soon as the corresponding agents are disconnected (it may takes about a minute for an agent to be considered disconnected once the corresponding container was stopped, but eventually the licenses are returned).

At the same time, the screenshot you provided demonstrates a situation where a single agent is connected but no additional licenses are left. This state is somewhat inconsistent (independently of your licensing model you should be able to activate at least 3 agents at any given time): no_license_left

Licensing issues are especially difficult to debug because the related library are undocumented and their bytecode obfuscated. At this point, I think that this issue deserves to be submitted to the TC issue tracker. I'll prepare something shortly and keep you informed.

JeanRev avatar Mar 07 '17 07:03 JeanRev

The problem finally came up on my test server, mostly as you describe: all agent are stopped and none can be started anymore.

Logs with debug details show indeed that the TC server incorrectly believes that no license are left. By looking at the log patterns I realized that the problem already occurred on the same server but was probably left unnoticed.

A new issue was raised in the TC issue tracker along with all the logs i have and a dump of the TeamCity database at the time where the problem was spotted: https://youtrack.jetbrains.com/issue/TW-48990

I will wait for their feedback before analyzing this further.

JeanRev avatar Mar 09 '17 05:03 JeanRev

Good news! Thank you, Jean!

sharov avatar Mar 09 '17 07:03 sharov