lithops icon indicating copy to clipboard operation
lithops copied to clipboard

Workers not able to be created message error - but machines exist

Open RichardScottOZ opened this issue 1 year ago • 36 comments


raceback (most recent call last):
  File lithopstest.py", line 44, in <module>
    fexec.map(my_map_function, args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/executors.py", line 279, in map
    futures = self.invoker.run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 280, in run_job
    futures = self._run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 222, in _run_job
    raise e
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 219, in _run_job
    self._invoke_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 267, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/standalone.py", line 286, in invoke
    raise Exception('It was not possible to create any worker')
Exception: It was not possible to create any worker

image

Possibly a public/private check issue?

RichardScottOZ avatar Nov 22 '24 20:11 RichardScottOZ

This is just running one of the example files in the repo doing some simple calcs.

RichardScottOZ avatar Nov 22 '24 20:11 RichardScottOZ

timed out
2024-11-23 06:49:50,337 [DEBUG] standalone.py:277 -- Found 0 free workers connected to VM instance lithops-master-dadab3
2024-11-23 06:49:50,337 [DEBUG] standalone.py:281 -- Going to create 2 new workers
{'username': 'ubuntu', 'password': 'c0d72593-5e1f-458b-aaad-74ab52be154d', 'key_filename': 'lithops-key-b597095a.aws_ec2.id_rsa'}
{'username': 'ubuntu', 'password': 'c0d72593-5e1f-458b-aaad-74ab52be154d', 'key_filename': 'lithops-key-b597095a.aws_ec2.id_rsa'}
2024-11-23 06:49:50,339 [DEBUG] aws_ec2.py:1223 -- Creating new VM instance lithops-worker-10c28285 (Spot)
2024-11-23 06:49:50,341 [DEBUG] aws_ec2.py:1223 -- Creating new VM instance lithops-worker-74d89509 (Spot)
2024-11-23 06:49:57,315 [DEBUG] standalone.py:256 -- Total worker VM instances created: 0/2
Traceback (most recent call last):
  File lithopstest.py", line 44, in <module>
    fexec.map(my_map_function, args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/executors.py", line 279, in map
    futures = self.invoker.run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 280, in run_job
    futures = self._run_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 222, in _run_job
    raise e
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 219, in _run_job
    self._invoke_job(job)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/invokers.py", line 267, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/standalone.py", line 286, in invoke
    raise Exception('It was not possible to create any worker')
Exception: It was not possible to create any worker

I checked the ssh connection error 'timed out' so possibly there or an interaction between the spot api and something else - two machines were created again. I have several alive from the tests last night too - so this error presumably preventing auto shutdown.

RichardScottOZ avatar Nov 22 '24 20:11 RichardScottOZ

some testing yesterday I did this, which enabled the master VM to start

def get_ssh_client(self):
        """
        Creates an ssh client against the VM
        """
        #if self.public:
        if self.public and 1 == 2:  #Richard 20241122
            if not self.ssh_client or self.ssh_client.ip_address != self.public_ip:
                self.ssh_client = SSHClient(self.public_ip, self.ssh_credentials)
        else:
            if not self.ssh_client or self.ssh_client.ip_address != self.private_ip:
                self.ssh_client = SSHClient(self.private_ip, self.ssh_credentials)

        return self.ssh_client

RichardScottOZ avatar Nov 22 '24 20:11 RichardScottOZ

Here, looking at public IP I think?

2024-11-23 07:10:32,657 [DEBUG] ssh_client.py:59 -- 10.1.111.11 ssh client created
2024-11-23 07:10:34,094 [DEBUG] standalone.py:277 -- Found 0 free workers connected to VM instance lithops-master-dadab3 (53.92.198.72)

RichardScottOZ avatar Nov 22 '24 20:11 RichardScottOZ

When I changed it from default spot, got further

2024-11-23 07:22:02,997 [DEBUG] aws_ec2.py:1275 -- VM instance lithops-worker-f12ca4e5 created successfully
2024-11-23 07:22:03,973 [DEBUG] aws_ec2.py:1275 -- VM instance lithops-worker-d10f6430 created successfully
2024-11-23 07:22:03,973 [DEBUG] standalone.py:259 -- Total worker VM instances created: 2/2
2024-11-23 07:22:03,973 [DEBUG] standalone.py:291 -- ExecutorID 661f55-0 | JobID M000 - Going to run 3 activations in 2 workers
2024-11-23 07:22:09,590 [DEBUG] invokers.py:271 -- ExecutorID 661f55-0 | JobID M000 - Job invoked (11.244s) - Activation ID: 661f55-0-M000
2024-11-23 07:22:09,590 [INFO] invokers.py:225 -- ExecutorID 661f55-0 | JobID M000 - View execution logs at /tmp/lithops-richard/logs/661f55-0-M000.log
2024-11-23 07:22:09,590 [DEBUG] monitor.py:429 -- ExecutorID 661f55-0 - Starting Storage job monitor
2024-11-23 07:22:09,590 [INFO] executors.py:494 -- ExecutorID 661f55-0 - Getting results from 3 function activations
2024-11-23 07:22:09,590 [INFO] wait.py:101 -- ExecutorID 661f55-0 - Waiting for 3 function activations to complete
2024-11-23 07:22:13,440 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:22:53,439 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:23:34,012 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0

RichardScottOZ avatar Nov 22 '24 20:11 RichardScottOZ

I wouldn't think that example takes very long to run though

2024-11-23 07:22:09,590 [INFO] wait.py:101 -- ExecutorID 661f55-0 - Waiting for 3 function activations to complete
2024-11-23 07:22:13,440 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:22:53,439 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:23:34,012 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:24:14,118 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:24:54,269 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 07:25:36,625 [DEBUG] monitor.py:144 -- ExecutorID 661f55-0 - Pending: 3 - Running: 0 - Done: 0

RichardScottOZ avatar Nov 22 '24 20:11 RichardScottOZ

and with not spot

lithops worker list tells me

Worker Name              Created                  Instance Type      Processes  Runtime    Mode    Status      TTD
-----------------------  -----------------------  ---------------  -----------  ---------  ------  ----------  -------
lithops-worker-f12ca4e5  2024-11-22 20:52:09 UTC  t3.medium                  2  python3    reuse   installing  Unknown
lithops-worker-d10f6430  2024-11-22 20:52:09 UTC  t3.medium                  2  python3    reuse   installing  Unknown

lithops job list has

Job ID         Function Name      Submitted                Worker Type    Runtime    Tasks Done    Job Status
-------------  -----------------  -----------------------  -------------  ---------  ------------  ------------
661f55-0-M000  my_map_function()  2024-11-22 20:51:58 UTC  t3.medium      python3    0/3           submitted

RichardScottOZ avatar Nov 22 '24 21:11 RichardScottOZ

I did not change any defaults from anything and default python3 - does installing status mean it is expecting a container called python3?

I definitely do not have a runtime set in the config.

In invokers

        if self.backend not in STANDALONE_BACKENDS:
            logger.debug(
                f'ExecutorID {job.executor_id} | JobID {job.job_id} - Worker processes: '
                f'{job.worker_processes} - Chunksize: {job.chunksize}'
            )

        try:
            print("TRYING RUNTIME:",self.runtime_name)
            job.runtime_name = self.runtime_name
            self._invoke_job(job)
        except (KeyboardInterrupt, Exception) as e:
            self.stop()
            raise e

gives

TRYING RUNTIME: python3

so should that actually be anything? For this default 'just do some calcs in the OS python3?

e.g. I left this going for some time

Option 1: By default, Lithops uses an Ubuntu 22.04 image. In this case, no further action is required and you can continue to the next step. Lithops will install all required dependencies in the VM by itself. Notice this can consume about 3 min to complete all installations.

Probably 15 minutes plus - I could leave it going for an extended period, but doesn't seem like it was going to work.

RichardScottOZ avatar Nov 22 '24 21:11 RichardScottOZ

Next test for timeout problems then I guess it to make a simple image or something along those lines.

RichardScottOZ avatar Nov 22 '24 21:11 RichardScottOZ

try redoing the default and see what happens?

RichardScottOZ avatar Nov 22 '24 22:11 RichardScottOZ

When I try that

2024-11-23 08:39:18,586 [DEBUG] aws_ec2.py:1276 -- VM instance building-image-lithops-ubuntu-jammy-22.04-amd64-server created successfully
2024-11-23 08:39:18,586 [DEBUG] aws_ec2.py:1132 -- Waiting VM instance building-image-lithops-ubuntu-jammy-22.04-amd64-server to become ready
but ssh create connection
timed out error

and

2024-11-23 08:39:27,666 [DEBUG] aws_ec2.py:1123 -- SSH to 0.0.0.0 failed (publickey): timed out

RichardScottOZ avatar Nov 22 '24 22:11 RichardScottOZ

so how to get this

  def is_ready(self):
        """
        Checks if the VM instance is ready to receive ssh connections
        """
        login_type = 'password' if 'password' in self.ssh_credentials and \
            not self.public else 'publickey'
        try:
            self.get_ssh_client().run_remote_command('id')
        except LithopsValidationError as err:
            raise err
        except Exception as err:
            #logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}')
            #logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}')
            logger.debug(f'SSH to {self.public_ip if self.public else self.private_ip} failed ({login_type}): {err}')
            self.del_ssh_client()
            return False
        return True

in aws_ec2 to handle a private case?

RichardScottOZ avatar Nov 22 '24 22:11 RichardScottOZ

again though it looks like it has made it - in the console

image

RichardScottOZ avatar Nov 22 '24 22:11 RichardScottOZ

and if I disable the error handling in the above

timed out
Traceback (most recent call last):
  File "/home/richard/miniconda3/envs/lithops/bin/lithops", line 8, in <module>
    sys.exit(lithops_cli())
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/scripts/cli.py", line 851, in build_image
    compute_handler.build_image(name, file, overwrite, include, ctx.args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/standalone.py", line 88, in build_image
    self.backend.build_image(image_name, script_file, overwrite, include, extra_args)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/standalone/backends/aws_ec2/aws_ec2.py", line 659, in build_image
    build_vm.get_ssh_client().upload_data_to_file(script, remote_script)
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/util/ssh_client.py", line 145, in upload_data_to_file
    self.ssh_client = self.create_client()
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/util/ssh_client.py", line 62, in create_client
    raise e
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/lithops/util/ssh_client.py", line 52, in create_client
    self.ssh_client.connect(
  File "/home/richard/miniconda3/envs/lithops/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
    sock.connect(addr)
TimeoutError: timed out

RichardScottOZ avatar Nov 22 '24 22:11 RichardScottOZ

which could be unintended consequence of my tinkering to get some other private_ip things to work?

RichardScottOZ avatar Nov 22 '24 22:11 RichardScottOZ

so failing to create a ssh_client on the run_remote part above

not sure why failing to connect to these build machines when others are ok as far as ssh connection via private ip

RichardScottOZ avatar Nov 22 '24 22:11 RichardScottOZ

I can definitely connect manually to a build-* machine anyway via the private ip so not sure why code is failing here.

RichardScottOZ avatar Nov 23 '24 00:11 RichardScottOZ

This run it has got to building an image, taking a while to do, but that is a bit further. Don't know what was difference, possibly badly flaky vpn and internet at the time earlier I guess, or master justbat shutdown

RichardScottOZ avatar Nov 23 '24 01:11 RichardScottOZ

2024-11-23 12:01:09,849 [DEBUG] aws_ec2.py:696 -- VM Image is being created. Current status: pending
2024-11-23 12:01:31,280 [DEBUG] aws_ec2.py:696 -- VM Image is being created. Current status: available
2024-11-23 12:01:31,280 [DEBUG] aws_ec2.py:1405 -- Deleting VM instance building-image-lithops-ubuntu-jammy-22.04-amd64-server (i-003042a7a1602ea80)
2024-11-23 12:01:31,753 [INFO] aws_ec2.py:707 -- VM Image created. Image ID: ami-06ea522e39999999
2024-11-23 12:01:31,753 [INFO] cli.py:853 -- VM Image built

RichardScottOZ avatar Nov 23 '24 01:11 RichardScottOZ

but trying that ami- in target_ami, back to solving this problem:

2024-11-23 12:28:29,116 [DEBUG] monitor.py:429 -- ExecutorID 017dd6-0 - Starting Storage job monitor
2024-11-23 12:28:29,116 [INFO] executors.py:494 -- ExecutorID 017dd6-0 - Getting results from 3 function activations
2024-11-23 12:28:29,116 [INFO] wait.py:101 -- ExecutorID 017dd6-0 - Waiting for 3 function activations to complete
2024-11-23 12:28:32,845 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:29:14,093 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:29:55,072 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:30:36,118 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:31:16,790 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:31:56,778 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:32:38,720 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:33:19,958 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:34:01,740 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:34:43,345 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0
2024-11-23 12:35:25,381 [DEBUG] monitor.py:144 -- ExecutorID 017dd6-0 - Pending: 3 - Running: 0 - Done: 0

e.g. this happily runs infinite loop

  • I actually left another running overnight - never timed out - so something up there?

RichardScottOZ avatar Nov 23 '24 02:11 RichardScottOZ

I let it run for an hour then ctrl-C'ed it.

2024-11-23 13:30:40,473 [DEBUG] ssh_client.py:65 -- 10.0.0.0  ssh client created
2024-11-23 13:30:40,683 [DEBUG] monitor.py:457 -- ExecutorID 017dd6-0 - Storage job monitor finished
2024-11-23 13:30:41,571 [INFO] executors.py:618 -- ExecutorID 017dd6-0 - Cleaning temporary data

RichardScottOZ avatar Nov 23 '24 03:11 RichardScottOZ

job payload has this at the start

JOB PAYLOAD: {'config': {'lithops': {'backend': 'aws_ec2', 'log_level': 'DEBUG', 'mode': 'standalone', 'chunksize': 0, 'storage': 'aws_s3', 'monitoring': 'storage', 'monitoring_interval': 2, 'execution_timeout': 1800, 'backend_type': 'batch'},

RichardScottOZ avatar Nov 23 '24 03:11 RichardScottOZ

Experiments are here: https://github.com/RichardScottOZ/lithops/tree/private_ip

RichardScottOZ avatar Nov 23 '24 21:11 RichardScottOZ

These exist - and are using BatchInvoker

FUTURES: [<lithops.future.ResponseFuture object at 0x7fc5b3446fb0>, <lithops.future.ResponseFuture object at 0x7fc5b00cc7f0>, <lithops.future.ResponseFuture object at 0x7fc5b00cc850>]

RichardScottOZ avatar Nov 23 '24 22:11 RichardScottOZ

Will simplify very slightly and just use this https://github.com/lithops-cloud/lithops/blob/master/examples/call_async.py - so one not 3 things.

RichardScottOZ avatar Nov 23 '24 22:11 RichardScottOZ

The latest test of that, the worker-service-log

cat /tmp/lithops-root/worker-service.log
2024-11-23 23:19:26,575 [ERROR] lithops.standalone.worker:107 -- Timeout connecting to server
2024-11-23 23:19:27,361 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:60 -- Creating AWS EC2 client
2024-11-23 23:19:29,680 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:103 -- AWS EC2 client created - Region: us-east-1
2024-11-23 23:19:29,681 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-11-23 23:19:29,681 [DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for lithops-worker-260b9387 (10.6.131.122), instance ID: i-075b34f8d528f726e
2024-11-23 23:19:29,681 [DEBUG] lithops.standalone.keeper:55 -- Delete lithops-worker-260b9387 on dismantle: True
2024-11-23 23:19:29,682 [DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-11-23 23:19:29,682 [DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 300s, Hard Timeout: 3600s
2024-11-23 23:19:29,682 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3596 seconds
2024-11-23 23:19:29,683 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: t3.medium - Runtime name: python3 - Worker processes: 2
2024-11-23 23:19:29,683 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-11-23 23:19:29,684 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-11-23 23:20:29,743 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 3536 seconds

RichardScottOZ avatar Nov 23 '24 23:11 RichardScottOZ

So does that mean that is the problem, workers can't get to master, or is that just first try and then they do?

/tmp/lithops-root/jobs and /logs are empty anyway

RichardScottOZ avatar Nov 23 '24 23:11 RichardScottOZ

on the worker

curl -X GET http://lithops-master:8080/ping
{"response":"3.5.1"}

and 8080 and 8081 need to be reachable for master and workers

RichardScottOZ avatar Nov 23 '24 23:11 RichardScottOZ

when I logged into a worker when starting I did see a python3 task appear via top - so something running but results not being collected?

RichardScottOZ avatar Nov 24 '24 01:11 RichardScottOZ

If wanted to put additional logging on workers - would have to do that, then make a new master so it would have an updated worker.py ?

RichardScottOZ avatar Nov 24 '24 01:11 RichardScottOZ