charts
charts copied to clipboard
Logs not being written to remote storage (GCS)
Checks
- [X] I have checked for existing issues.
- [X] This report is about the
User-Community Airflow Helm Chart
.
Chart Version
8.6.1
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12-gke.2200", GitCommit:"6c11aec6ce32cf0d66a2631eed2eb49dd65c89f8", GitTreeState:"clean", BuildDate:"2022-05-20T09:29:14Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1
Helm Version
version.BuildInfo{Version:"v3.7.0", GitCommit:"eeac83883cb4014fe60267ec6373570374ce770b", GitTreeState:"clean", GoVersion:"go1.16.8"}
Description
I'm using Remote Logging to store logs in a GCS bucket, but the logs don't seem to be uploaded. This is what I see inside tasks:
*** Unable to read remote log from gs://airflow-logs-store/staging/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/staging%2Fdag_id%3Ddo_labeled_pack_data_etl%2Frun_id%3Dmanual__2022-08-01T09%3A35%3A44%2B00%3A00%2Ftask_id%3Dlist_batch_packs%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/staging/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
*** Log file does not exist: /opt/airflow/logs/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log
*** Fetching from: http://airflow-worker-0.airflow-worker.airflow.svc.cluster.local:8793/log/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log
This seems to happen with all DAGs.
I have a connection to GCP already set and working, and the Service Account has the right permissions (in fact this DAG moves some data to a GCS bucket using a Hook with that connection, assuring that the connection works), and the DAGs all end successfully (no crashes, final status is either success or fail, local logs are also available).
This is the current configuration for Logging:
$ airflow config list
[logging]
base_log_folder = /opt/airflow/logs
remote_logging = True
remote_log_conn_id = google_cloud_default
google_key_path =
remote_base_log_folder = gs://airflow-logs-store/staging
encrypt_s3_logs = False
logging_level = INFO
celery_logging_level =
fab_logging_level = WARNING
logging_config_class =
colored_console_log = True
colored_log_format = [%(blue)s%(asctime)s%(reset)s] {%(blue)s%(filename)s:%(reset)s%(lineno)d} %(log_color)s%(levelname)s%(reset)s - %(log_color)s%(message)s%(reset)s
colored_formatter_class = airflow.utils.log.colored_log.CustomTTYColoredFormatter
log_format = [%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s
simple_log_format = %(asctime)s %(levelname)s - %(message)s
task_log_prefix_template =
log_filename_template = dag_id={{ ti.dag_id }}/run_id={{ ti.run_id }}/task_id={{ ti.task_id }}/{% if ti.map_index >= 0 %}map_index={{ ti.map_index }}/{% endif %}attempt={{ try_number }}.log
log_processor_filename_template = {{ filename }}.log
dag_processor_manager_log_location = /opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
task_log_reader = task
extra_logger_names =
worker_log_server_port = 8793
I have also tried adding a custom google_key_path
(the same that I'm using for the default connection) but the logs still weren't updated. What can I do to troubleshoot it?
Relevant Logs
No response
Custom Helm Values
airflow:
executor: CeleryExecutor
defaultSecurityContext: {}
usersUpdate: false
config:
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "60"
AIRFLOW__CORE__ENABLE_XCOM_PICKLING: "True"
AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth, airflow.api.auth.backend.session"
AIRFLOW__LOGGING__REMOTE_LOGGING: "true"
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "" # This is set by helm --set
AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "google_cloud_default"
AIRFLOW__LOGGING__COLORED_CONSOLE_LOG: "true"
GOOGLE_APPLICATION_CREDENTIALS: "xxx/airflow-sa.json"
PYTHONPATH: "/opt/airflow/dags/repo"
connectionsUpdate: true
connections:
- id: google_cloud_default
type: google_cloud_platform
description: Vector GCP
## see the official "google_cloud_platform" connection docs for valid extra configs
extra: |-
{"extra__google_cloud_platform__key_path": "xxx/airflow-sa.json", "extra__google_cloud_platform__key_secret_name": "", "extra__google_cloud_platform__keyfile_dict": "", "extra__google_cloud_platform__num_retries": 5, "extra__google_cloud_platform__project": "", "extra__google_cloud_platform__scope": ""}
extraVolumeMounts:
## spec for VolumeMount: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#volumemount-v1-core
- name: google-service-account
mountPath: xxx
readOnly: true
extraVolumes:
## spec for Volume: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#volume-v1-core
- name: google-service-account
secret:
secretName: gcloud-key-airflow
dags:
gitSync:
enabled: true
repo: "xxx"
branch: "master"
revision: "HEAD"
syncWait: 60
sshSecret: "xxx"
sshSecretKey: "xxx"
repoSubPath: "dags"
ingress:
enabled: false
web:
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rewrite-target: /$1
path: ""
host: ""
@nhabbash a few questions:
- What version of airflow are you running (if its
1.10
you will need to review the older docs - When you look into your GCS bucket, do you see any log files being created?
- Are you seeing logs in GCS for tasks that succeed? (airflow only uploads the logs at the end of a task, so if the pod crashes none will be uploaded)
- Are you 100% sure that
Secret/gcloud-key-airflow
contains a key calledairflow-sa.json
which is correctly formatted? (usekubectl get secret -o yaml
to figure out) - Why are you setting
extra__google_cloud_platform__key_secret_name
andextra__google_cloud_platform__keyfile_dict
as""
, I imagine that might overwriteextra__google_cloud_platform__key_path
- What version of airflow are you running (if its
1.10
you will need to review the older docs
Airflow 2.3.3, mistakenly removed it from OP
- When you look into your GCS bucket, do you see any log files being created? I've actually noticed that for some DAGs the logs are being written to GCS. I have this DAG here that reports the following:
*** Reading remote log from gs://airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-02T07:38:30.880016+00:00/task_id=create_org_config_file/attempt=1.log.
*** Previous log discarded: 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Dcommon_create_organization_config%2Frun_id%3Dmanual__2022-08-02T07%3A38%3A30.880016%2B00%3A00%2Ftask_id%3Dcreate_org_config_file%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-02T07:38:30.880016+00:00/task_id=create_org_config_file/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
On GCS, I see the files also being created. I also rerun some of the other DAGs but I still get the following:
*** Unable to read remote log from gs://airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-02T09:40:13+02:00/task_id=list_batch_packs/attempt=2.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Ddo_labeled_pack_data_etl%2Frun_id%3Dmanual__2022-08-02T09%3A40%3A13%2B02%3A00%2Ftask_id%3Dlist_batch_packs%2Fattempt%3D2.log?alt=media: No such object: airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-02T09:40:13+02:00/task_id=list_batch_packs/attempt=2.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
If for some DAGs it's uploading the logs I believe the issue might not be the connection, but something with the the DAG definitions? I can share the code or definition if that's the case.
- Are you seeing logs in GCS for tasks that succeed? (airflow only uploads the logs at the end of a task, so if the pod crashes none will be uploaded)
Tasks always succeed or fail, I have no crashes happening (in fact Airflow is always reading the local logs). In GCS I don't see logs for the DAGs that report Unable to read remote log from ... 404 GET
, as they aren't uploaded altogether. But as I discovered there are a couple of DAGs for which logs are uploaded to GCS, and I can see them both there and in Airflow, as they're read.
- Are you 100% sure that
Secret/gcloud-key-airflow
contains a key calledairflow-sa.json
which is correctly formatted? (usekubectl get secret -o yaml
to figure out)
Yes, the key is there and it's mounted correctly by K8s in the PODs.
- Why are you setting
extra__google_cloud_platform__key_secret_name
andextra__google_cloud_platform__keyfile_dict
as""
, I imagine that might overwriteextra__google_cloud_platform__key_path
I tried just setting extra__google_cloud_platform__key_path
, but the extra JSON field wouldn't parse in Airflow (I would get an error when using the connection). This solved it, I think you need to specify all the required fields. I did found this solution either on some Issue in the official Airflow repo or on StackOverflow but I can't find it again. Inspecting the connection in the UI and through the CLI shows that the connection has the right key_path
, so it's not overwritten. Anyway, the connection works when using it for Hooks (I can list/upload/create buckets and blobs on GCS), and it seems like it is working sometimes for uploading logs, but most of the times it isn't.
@nhabbash if you are seeing logs in GCS for some DAGs but not others, you should compare what is happening between those dags.
I almost guarantee that something is crashing in the DAGs which are failing to upload, as this is incredibly common (because airflow only uploads logs to GCS after the task is finished).
@thesuperzapper the DAGs and their tasks are succeeding in both cases and not crashing - if they are crashing somewhere I'm not really seeing it. Here are some screenshots and logs (just the log loading and task succeeding message):
DAG w/ remote logging not working (first task example):
Screenshots
Log
*** Unable to read remote log from gs://airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-04T08:59:12+00:00/task_id=list_batch_packs/attempt=1.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Ddo_labeled_pack_data_etl%2Frun_id%3Dmanual__2022-08-04T08%3A59%3A12%2B00%3A00%2Ftask_id%3Dlist_batch_packs%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-04T08:59:12+00:00/task_id=list_batch_packs/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
*** Reading local file: /opt/airflow/logs/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-04T08:59:12+00:00/task_id=list_batch_packs/attempt=1.log
...
...
[2022-08-04, 08:59:26 UTC] {python.py:173} INFO - Done. Returned value was: <REDACTED (it's a dict)>
[2022-08-04, 08:59:26 UTC] {taskinstance.py:1415} INFO - Marking task as SUCCESS. dag_id=do_labeled_pack_data_etl, task_id=list_batch_packs, execution_date=20220804T085912, start_date=20220804T085919, end_date=20220804T085926
[2022-08-04, 08:59:26 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2022-08-04, 08:59:26 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
DAG w/ remote logging working:
Screenshots
Log
*** Reading remote log from gs://airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-04T08:57:54.498373+00:00/task_id=create_org_config_file/attempt=1.log.
*** Previous log discarded: 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Dcommon_create_organization_config%2Frun_id%3Dmanual__2022-08-04T08%3A57%3A54.498373%2B00%3A00%2Ftask_id%3Dcreate_org_config_file%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-04T08:57:54.498373+00:00/task_id=create_org_config_file/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
...
...
[2022-08-04, 08:57:56 UTC] {python.py:173} INFO - Done. Returned value was: None
[2022-08-04, 08:57:56 UTC] {taskinstance.py:1415} INFO - Marking task as SUCCESS. dag_id=common_create_organization_config, task_id=create_org_config_file, execution_date=20220804T085754, start_date=20220804T085755, end_date=20220804T085756
[2022-08-04, 08:57:56 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2022-08-04, 08:57:56 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
@nhabbash I think the docs forgot to say that you MUST set AIRFLOW__LOGGING__GOOGLE_KEY_PATH
if not using "GKE Workload Identity".
Can you confirm that everything works when you set values like this:
airflow:
config:
AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "gs://<<MY-BUCKET-NAME>>/airflow/logs"
AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "my_gcp"
## this is the new part \/
AIRFLOW__LOGGING__GOOGLE_KEY_PATH: "/opt/airflow/secrets/gcp-keyfile/keyfile.json"
## this is the new part /\
connections:
- id: my_gcp
type: google_cloud_platform
description: my GCP connection
extra: |
{
"extra__google_cloud_platform__key_path": "/opt/airflow/secrets/gcp-keyfile/keyfile.json",
"extra__google_cloud_platform__num_retries: "5"
}
extraVolumeMounts:
- name: gcp-keyfile
mountPath: /opt/airflow/secrets/gcp-keyfile
readOnly: true
extraVolumes:
- name: gcp-keyfile
secret:
## assumes that `Secret/my-gcp-keyfile` contains a key called `keyfile.json`
secretName: my-gcp-keyfile
Also, can you confirm 100% if only setting extra__google_cloud_platform__key_path
and extra__google_cloud_platform__num_retries
works, or if you need to specify the other ones?
@thesuperzapper I did try setting AIRFLOW__LOGGING__GOOGLE_KEY_PATH
previously, but it did not work. I tried again with these values:
airflow:
config:
AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "" # Set through helm --set
AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "google_cloud_default"
AIRFLOW__LOGGING__GOOGLE_KEY_PATH: "/opt/airflow/secrets/google-service-account/airflow-sa.json"
GOOGLE_APPLICATION_CREDENTIALS: "/opt/airflow/secrets/google-service-account/airflow-sa.json"
connectionsUpdate: true
connections:
- id: google_cloud_default
type: google_cloud_platform
description: GCP Connection
extra: |-
{"extra__google_cloud_platform__key_path": "/opt/airflow/secrets/google-service-account/airflow-sa.json", "extra__google_cloud_platform__num_retries": 5}
extraVolumeMounts:
- name: google-service-account
mountPath: /opt/airflow/secrets/google-service-account
readOnly: true
extraVolumes:
- name: google-service-account
secret:
secretName: gcloud-key-airflow
Unfortunately for remote logging I'm getting still the exact same output as my last post, that is the DAG that previously had remote logging still has it, the other does not, and the situation is identical (same logs and status). This is the config:
[logging]
base_log_folder = /opt/airflow/logs
remote_logging = True
remote_log_conn_id = google_cloud_default
google_key_path = /opt/airflow/secrets/google-service-account/airflow-sa.json
remote_base_log_folder = gs://airflow-logs-store/staging
One thing is good though, which is that the connection is set correctly even without defining all the extra fields (previously I was getting something similar to this, but maybe this chart does things differently, although I wouldn't be able to explain why I wasn't able to use the field with just key_path
set previously.
@naturalett can you try restarting all your worker pods (so they pick up new secret values)?
Also, where are you copying the [logging]
configs you believe are in your cluster from?
@thesuperzapper this was a fresh branch deployment, so the pods were newly created ones and the secrets were mounted anew. the logging
config I got from running a shell in an Airflow worker pod and running airflow config list
, just to check whether everything has been set correctly
This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.
Issues never become stale if any of the following is true:
- they are added to a Project
- they are added to a Milestone
- they have the
lifecycle/frozen
label
@nhabbash did you end up solving this logging issue?
I am not sure why the bot closed this, as there was clearly activity in the last 60 days.
Either way, do you have any updates @nhabbash?
I am not sure why the bot closed this, as there was clearly activity in the last 60 days.
Either way, do you have any updates @nhabbash?
Hey @thesuperzapper, sorry for the late reply. Unfortunately haven't found a solution for this, as I moved onto working on other stuff I haven't had the chance to revisit the issue. I tried checking again just now, and I'm still left with some DAGs that are able to write logs to GCS, and others that do not write them and hence lose the logs once the pods are restarted.
This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.
Issues never become stale if any of the following is true:
- they are added to a Project
- they are added to a Milestone
- they have the
lifecycle/frozen
label