metaflow
metaflow copied to clipboard
Batch jobs fail to start with a bash error "unary operator expected"
Over the years, we have received a number of sporadic reports of @batch jobs failing without anything on the Metaflow console but CloudWatch containing messages like:
| 2023-05-03T13:58:18.077-07:00 | Setting up task environment.
-- | -- | --
| 2023-05-03T13:58:24.321-07:00 | bash: line 1: [: -le: unary operator expected
| 2023-05-03T13:58:24.321-07:00 | bash: line 1: [: -gt: unary operator expected
| 2023-05-03T13:58:24.322-07:00 | tar: job.tar: Cannot open: No such file or directory
| 2023-05-03T13:58:24.323-07:00 | tar: Error is not recoverable: exiting now
| 2023-05-03T13:58:24.336-07:00 | /usr/local/bin/python: Error while finding module specification for 'metaflow.mflog.save_logs' (ModuleNotFoundError
This seems to happen if Metaflow fails to install its dependencies in the entrypoint (awscli / pip etc), e.g. due to upstream package repos not being responsive. The issue typically fixes itself after a while.
We could provide a better error message at least
I see a similar issue trying to use a custom image on google kubernetes cluster. I use the following image
@kubernetes(image="google/cloud-sdk:latest")
But the job fails with the following errors in the log file
INFO 2023-09-25T15:08:04 Setting up task environment.
ERROR 2023-09-25T15:08:04 bash: line 1: python: command not found
ERROR 2023-09-25T15:08:04 bash: line 1: [: -le: unary operator expected
ERROR 2023-09-25T15:08:04 bash: line 1: [: -gt: unary operator expected
ERROR 2023-09-25T15:08:04 tar: job.tar: Cannot open: No such file or directory
ERROR 2023-09-25T15:08:04 tar: Error is not recoverable: exiting now
It looks to me as if metaflow cannot handle custom images which don't have python preinstalled. Is that correct?
UPDATE: It seems the image "google/cloud-sdk:latest" doesn't offer the command python, but only the command python3 (see https://github.com/GoogleCloudPlatform/cloud-sdk-docker/issues/273). Unfortunately metaflow seems to assume the command is called python and fails otherwise, see
https://github.com/Netflix/metaflow/blob/2b1eab83f64f8d6034504f51ca9aa828400e0d05/metaflow/metaflow_environment.py#L201C13-L201C13