clay icon indicating copy to clipboard operation
clay copied to clipboard

Custom Task API Provider raises exception in Compute, but Golem fails to send TaskFailed

Open ktomala opened this issue 4 years ago • 1 comments

Description

Golem Version: b0.22

Golem-Messages version (leave empty if unsure):

Electron version (if used):

OS [e.g. Windows 10 Pro]: ubuntu 16.04

Branch (if launched from source): b0.22

Mainnet/Testnet: testnet

Priority label is set to the lowest by default. To setup higher priority please change the label P0 label is set for Severity-Critical/Effort-easy P1 label is set for Severity-Critical/Effort-hard P2 label is set for Severity-Low/ Effort-easy P3 label is set for Severity-Low/Effort-hard

Description of the issue:

When Task API custom Provider fails raising exception this will happen:

Traceback (most recent call last):
File "/home/wolverine/.virtualenvs/golem-cloud/lib/python3.6/site-packages/golem_task_api/client.py", line 305, in compute
reply = await self._golem_app.Compute(request)
File "/home/wolverine/.virtualenvs/golem-cloud/lib/python3.6/site-packages/grpclib/client.py", line 598, in _call_
return await stream.recv_message()
File "/home/wolverine/.virtualenvs/golem-cloud/lib/python3.6/site-packages/grpclib/client.py", line 329, in recv_message
await self.recv_initial_metadata()
File "/home/wolverine/.virtualenvs/golem-cloud/lib/python3.6/site-packages/grpclib/client.py", line 271, in recv_initial_metadata
self._raise_for_grpc_status(headers_map, optional=True)
File "/home/wolverine/.virtualenvs/golem-cloud/lib/python3.6/site-packages/grpclib/client.py", line 240, in _raise_for_grpc_status
raise GRPCError(grpc_status_enum, status_message)
grpclib.exceptions.GRPCError: (<Status.INTERNAL: 13>, "Pipeline process exited with code 1. out:b'UID: 0 GID: 0\\n--- Scanning interfaces, routes and network configuration\\n--- Network settings:\\nIPv4: 172.17.0.4/16 Interface: eth0 Gateway: 172.17.0.1\\nERROR: OpenVPN client private key not found.\\n' err:b''")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/wolverine/src/golem.network/golem-cloud/golem/task/taskcomputer.py", line 389, in _wait_until_computation_ends
output_file = yield self._computation
File "/home/wolverine/src/golem.network/golem-cloud/golem/core/deferred.py", line 75, in adapt
extracted = result.result()
File "/home/wolverine/src/golem.network/golem-cloud/golem/task/taskcomputer.py", line 378, in _create_client_and_compute
subtask_params=assigned_task.subtask_params
File "/home/wolverine/.virtualenvs/golem-cloud/lib/python3.6/site-packages/golem_task_api/client.py", line 310, in compute
await self._service.wait_until_shutdown_complete()
File "/home/wolverine/src/golem.network/golem-cloud/golem/task/task_api/__init__.py", line 89, in wait_until_shutdown_complete
await self._runtime.clean_up().asFuture(loop)
ValueError: Invalid status: TORN_DOWN. Expected: FAILURE or STOPPED

It seems that there's a bug in Golem that either does not send task failure properly or does not clean up properly (doesn't catch all errors?).

After this on Requestor this will start happening:

INFO     [golem.task.tasksession             ] Offer confirmed, assigning subtask(s) 
INFO     [golem.task.tasksession             ] Subtask assigned. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8', subtask_id='1a88bd8a-2263-11ea-9e44-7bff439d5ce5' 
INFO     [golem.task.taskserver              ] Computation for task '1a88bd8a-2263-11ea-9e44-7bff439d5ce5' failed: 'Invalid status: TORN_DOWN. Expected: FAILURE or STOPPED'. 
WARNING  [golem.task.taskmanager             ] This is not my subtask '1a88bd8a-2263-11ea-9e44-7bff439d5ce5' 
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8' 
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=2, task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5' 
INFO     [golem.client                       ] Updating mask. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', new_mask_size=0 
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8' 
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=3, task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5' 
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8' 
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=4, task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5' 
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8' 
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=5, task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5' 
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8' 
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=6, task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5' 
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8' 
INFO     [golem.task.taskrequestorstats      ] Received work offers. offer_count=7, task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5' 
INFO     [golem.client                       ] Updating mask. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', new_mask_size=0 
INFO     [golem.task.tasksession             ] Received offer to compute. task_id='0d46d322-2263-11ea-84b2-7bff439d5ce5', node='a80a2272..a7510fd8' 

On provider this will happen:

INFO     [golem.network.transport.tcpserver  ] Connecting to peer. node=''(7bff439d..0b4a52ab), adresses=['83.17.198.65:40105', '10.128.64.147:40105', '172.17.0.1:40105'] 
INFO     [golem.task.tasksession             ] Task request rejected. task_id: '0d46d322-2263-11ea-84b2-7bff439d5ce5', reason: <REASON.NoMoreSubtasks: 'NoMoreSubtasks'> 
ERROR    [asyncio                            ] Unclosed connection: Channel('172.17.0.4', 50006, ..., path=None) 
INFO     [golem.network.transport.tcpserver  ] Connecting to peer. node=''(7bff439d..0b4a52ab), adresses=['83.17.198.65:40105', '10.128.64.147:40105', '172.17.0.1:40105'] 
INFO     [golem.task.tasksession             ] Task request rejected. task_id: '0d46d322-2263-11ea-84b2-7bff439d5ce5', reason: <REASON.NoMoreSubtasks: 'NoMoreSubtasks'> 
INFO     [golem.network.transport.tcpserver  ] Connecting to peer. node=''(7bff439d..0b4a52ab), adresses=['83.17.198.65:40105', '10.128.64.147:40105', '172.17.0.1:40105'] 
INFO     [golem.task.tasksession             ] Task request rejected. task_id: '0d46d322-2263-11ea-84b2-7bff439d5ce5', reason: <REASON.NoMoreSubtasks: 'NoMoreSubtasks'> 
INFO     [golem.network.transport.tcpserver  ] Connecting to peer. node=''(7bff439d..0b4a52ab), adresses=['83.17.198.65:40105', '10.128.64.147:40105', '172.17.0.1:40105'] 
INFO     [golem.task.tasksession             ] Task request rejected. task_id: '0d46d322-2263-11ea-84b2-7bff439d5ce5', reason: <REASON.NoMoreSubtasks: 'NoMoreSubtasks'> 
INFO     [golem.network.transport.tcpserver  ] Connecting to peer. node=''(7bff439d..0b4a52ab), adresses=['83.17.198.65:40105', '10.128.64.147:40105', '172.17.0.1:40105'] 
INFO     [golem.task.tasksession             ] Task request rejected. task_id: '0d46d322-2263-11ea-84b2-7bff439d5ce5', reason: <REASON.NoMoreSubtasks: 'NoMoreSubtasks'> 
INFO     [golem.network.transport.tcpserver  ] Connecting to peer. node=''(7bff439d..0b4a52ab), adresses=['83.17.198.65:40105', '10.128.64.147:40105', '172.17.0.1:40105'] 
INFO     [golem.task.tasksession             ] Task request rejected. task_id: '0d46d322-2263-11ea-84b2-7bff439d5ce5', reason: <REASON.NoMoreSubtasks: 'NoMoreSubtasks'> 
INFO     [golem.network.transport.tcpserver  ] Connecting to peer. node=''(7bff439d..0b4a52ab), adresses=['83.17.198.65:40105', '10.128.64.147:40105', '172.17.0.1:40105'] 
INFO     [golem.task.tasksession             ] Task request rejected. task_id: '0d46d322-2263-11ea-84b2-7bff439d5ce5', reason: <REASON.NoMoreSubtasks: 'NoMoreSubtasks'> 

In my opinion neither provider nor requestor is banned in this situation for the task. There's no information about this in the logs.

For clarity the node is also not banned:

Blocked nodes
┌────────┬────────────┐
│  node  │  valid to  │
├────────┼────────────┤
│        │            │
└────────┴────────────┘

When trying to ask requestor abot the subtasks:

ERROR [golem_rpc_api::rpc] on comp.task.subtasks unable to parse: [Array([Object({"task_id": String("0d46d322-2263-11ea-84b2-7bff439d5ce5"), "subtask_id": String("1a88bd8a-2263-11ea-9e44-7bff439d5ce5"), "status": String("Starting"), "time_started": Number(1576761963.07706), "node_id": String("a80a2272e07cf2cb606e38efa47dafe5ee38ce166bc637eb13b206cffb0a9e22b294cf9f889ccdf94c5df62ef0ad8988ceda82cc8c322ff859f12d13a7510fd8"), "node_name": String("")})])]: missing field `results`
ERROR [golemcli] : missing field `results`

Actual result:

SubTask will be noted as failed and exception will be properly handled.

Screenshots:

Steps To Reproduce

Short description of steps to reproduce the behavior: e.g.

  1. Launch the app with '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

(What is the expected behavior and/or result in this scenario)

Logs and any additional context

(Any additional information that could help to resolve the issue, which systems were checked to reproduce the issue) Please upload your logs if possible

Proposed Solution?

(Optional: What could be a solution for that issue)

ktomala avatar Dec 19 '19 13:12 ktomala

@ktomala looking at the code this should be handled properly.

Could you please upload a debug log from this error so i can have a closer look?

maaktweluit avatar Jan 02 '20 10:01 maaktweluit