manageiq-providers-amazon
manageiq-providers-amazon copied to clipboard
Amazon SSA not working
Hello,
Customer is trying to run SmartState Analysis on an instance of AWS running CentOS.
In the user tasks this analysis fails with: "job timed out after 3060.770851409 seconds of inactivity. Inactivity threshold [3000 seconds]"
This is Infrastructure Management running as virtual appliance on VMWare.
In the provided logs (TS004217864/2020-09-27/ dir on ecurep) I can see:
{"@timestamp":"2020-10-07T14:04:52.176002 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb296b6e48","level":"debug","message":"[Aws::SQS::Client 400 0.057859 0 retries] get_queue_url(queue_name:\"smartstate_extract_reply-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::SQS::Errors::NonExistentQueue The specified queue does not exist for this wsdl version."}
{"@timestamp":"2020-10-07T14:05:02.577138 ","hostname":"1-amazon-agent-coordinator-0-55fd49dc87-hsgwd","pid":7,"tid":"2acb257d795c","level":"debug","message":"[Aws::S3::Client 404 0.20764 0 retries] head_bucket(bucket:\"smartstate-b0d2ab4b-40ed-4025-a5e2-788ba23ba35d\") Aws::S3::Errors::NotFound "}
--> process agent coordinator running
12190 root 21 1 395768 204608 14740 R 15.4 0.8 0:09.24 MIQ: Amazon::CloudManager::EventCatcher id: 76, queue: ems_3
12188 root 30 10 361160 202108 13100 R 15.3 0.8 0:09.19 MIQ: Amazon::AgentCoordinatorWorker id: 75, queue: ems_agent_coordinator
12194 root 23 3 361408 195428 13128 R 14.7 0.8 0:08.85 MIQ: Amazon::CloudManager::MetricsCollectorWorker id: 78, queue: amazon
--> The related task info with GUID
[----] I, [2020-09-27T15:03:16.993192 #2228:2abfbf325970] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker#start) Worker started: ID [75], PID [12188], GUID [5b493b88-02ff-4570-8697-6e393122260b]
--> I have looked at ews.log for GUID but, the only timeout error I found is the following:
[----] I, [2020-09-27T15:59:18.098238 #12188:2aeeb017595c] INFO -- : MIQ(ManageIQ::Providers::Amazon::AgentCoordinatorWorker::Runner#do_work) Alive agents in EMS(guid=1c9078f4-1231-4f90-9b63-29fbc205ce72): [].
.....
and then:
[----] I, [2020-09-27T15:59:23.405619 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5879], Delivering...
[----] W, [2020-09-27T15:59:23.409676 #2684:2ad3ca51b968] WARN -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#timeout!) Job: guid: [17167f79-8f9d-428f-a382-4cb29ee8e498], job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds], aborting
[----] I, [2020-09-27T15:59:23.416496 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5882], id: [], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"]
[----] I, [2020-09-27T15:59:23.416693 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5879], State: [ok], Delivered in [0.01107655] seconds
[----] I, [2020-09-27T15:59:23.422359 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5881], MiqWorker id: [64], Zone: [], Role: [], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqQueue.check_for_timeout], Timeout: [600], Priority: [90], State: [dequeue], Deliver On: [], Data: [], Args: [], Dequeued in: [3.302457898] seconds
[----] I, [2020-09-27T15:59:23.422617 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5881], Delivering...
[----] I, [2020-09-27T15:59:23.423784 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#delivered) Message id: [5881], State: [ok], Delivered in [0.001177885] seconds
[----] I, [2020-09-27T15:59:28.470684 #2228:2abfbf325970] INFO -- : MIQ(MiqServer#populate_queue_messages) Fetched 1 miq_queue rows for queue_name=generic, wcount=4, priority=200
[----] I, [2020-09-27T15:59:29.506630 #2684:2ad3ca51b968] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [5882], MiqWorker id: [64], Zone: [default], Role: [smartstate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [1], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: ["job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]", "error"], Dequeued in: [6.094388538] seconds
[----] I, [2020-09-27T15:59:29.506939 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue#deliver) Message id: [5882], Delivering...
[----] I, [2020-09-27T15:59:29.656524 #2684:2ad3ca51b968] INFO -- : MIQ(MiqQueue.put) Message id: [5883], id: [], Zone: [default], Role: [automate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [20], State: [ready], Deliver On: [], Data: [], Args: [{:object_type=>"ManageIQ::Providers::Amazon::CloudManager::Vm", :object_id=>34, :attrs=>{:event_type=>"vm_scan_abort", "VmOrTemplate::vm"=>34, :vm_id=>34, :host=>nil, "MiqEvent::miq_event"=>308, :miq_event_id=>308, "EventStream::event_stream"=>308, :event_stream_id=>308}, :instance_name=>"Event", :user_id=>1, :miq_group_id=>1, :tenant_id=>1, :automate_message=>nil}]
[----] E, [2020-09-27T15:59:29.656639 #2684:2ad3ca51b968] ERROR -- : MIQ(ManageIQ::Providers::Amazon::CloudManager::Scanning::Job#process_abort) job aborting, job timed out after 3052.271594948 seconds of inactivity. Inactivity threshold [3000 seconds]
[----] E, [2020-09-27T15:59:29.662790 #2684:2ad3ca51b968] ERROR -- : MIQ(MiqQueue#deliver) Message id: [5882], Error: [finish is not permitted at state aborting]
[----] E, [2020-09-27T15:59:29.662909 #2684:2ad3ca51b968] ERROR -- : [RuntimeError]: finish is not permitted at state aborting Method:[block (2 levels) in class:LogProxy]
[----] E, [2020-09-27T15:59:29.662994 #2684:2ad3ca51b968] ERROR -- : /var/www/miq/vmdb/app/models/job/state_machine.rb:50:in `signal'
Not sure if this is related to the issue.
cc @jerryk55
@agrare was this ever addressed? I recall you making some fixes for SmartState Analysis, but cannot figure out if this is still an issue
There was an issue with the agent_coordinator worker that we fixed, not sure if that fixes this or not though
This issue has been automatically marked as stale because it has not been updated for at least 3 months.
If you can still reproduce this issue on the current release or on master
, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.