charts icon indicating copy to clipboard operation
charts copied to clipboard

Logs not being written to remote storage (GCS)

Open nhabbash opened this issue 2 years ago • 8 comments

Checks

Chart Version

8.6.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12-gke.2200", GitCommit:"6c11aec6ce32cf0d66a2631eed2eb49dd65c89f8", GitTreeState:"clean", BuildDate:"2022-05-20T09:29:14Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1

Helm Version

version.BuildInfo{Version:"v3.7.0", GitCommit:"eeac83883cb4014fe60267ec6373570374ce770b", GitTreeState:"clean", GoVersion:"go1.16.8"}

Description

I'm using Remote Logging to store logs in a GCS bucket, but the logs don't seem to be uploaded. This is what I see inside tasks:

*** Unable to read remote log from gs://airflow-logs-store/staging/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/staging%2Fdag_id%3Ddo_labeled_pack_data_etl%2Frun_id%3Dmanual__2022-08-01T09%3A35%3A44%2B00%3A00%2Ftask_id%3Dlist_batch_packs%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/staging/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

*** Log file does not exist: /opt/airflow/logs/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log
*** Fetching from: http://airflow-worker-0.airflow-worker.airflow.svc.cluster.local:8793/log/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-01T09:35:44+00:00/task_id=list_batch_packs/attempt=1.log

This seems to happen with all DAGs.

I have a connection to GCP already set and working, and the Service Account has the right permissions (in fact this DAG moves some data to a GCS bucket using a Hook with that connection, assuring that the connection works), and the DAGs all end successfully (no crashes, final status is either success or fail, local logs are also available).

This is the current configuration for Logging:

$ airflow config list
[logging]
base_log_folder = /opt/airflow/logs
remote_logging = True
remote_log_conn_id = google_cloud_default
google_key_path = 
remote_base_log_folder = gs://airflow-logs-store/staging
encrypt_s3_logs = False
logging_level = INFO
celery_logging_level = 
fab_logging_level = WARNING
logging_config_class = 
colored_console_log = True
colored_log_format = [%(blue)s%(asctime)s%(reset)s] {%(blue)s%(filename)s:%(reset)s%(lineno)d} %(log_color)s%(levelname)s%(reset)s - %(log_color)s%(message)s%(reset)s
colored_formatter_class = airflow.utils.log.colored_log.CustomTTYColoredFormatter
log_format = [%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s
simple_log_format = %(asctime)s %(levelname)s - %(message)s
task_log_prefix_template = 
log_filename_template = dag_id={{ ti.dag_id }}/run_id={{ ti.run_id }}/task_id={{ ti.task_id }}/{% if ti.map_index >= 0 %}map_index={{ ti.map_index }}/{% endif %}attempt={{ try_number }}.log
log_processor_filename_template = {{ filename }}.log
dag_processor_manager_log_location = /opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
task_log_reader = task
extra_logger_names = 
worker_log_server_port = 8793

I have also tried adding a custom google_key_path (the same that I'm using for the default connection) but the logs still weren't updated. What can I do to troubleshoot it?

Relevant Logs

No response

Custom Helm Values

airflow:
  executor: CeleryExecutor
  defaultSecurityContext: {}
  usersUpdate: false
  config:
    AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "60"
    AIRFLOW__CORE__ENABLE_XCOM_PICKLING: "True"
    AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth, airflow.api.auth.backend.session"
    
    AIRFLOW__LOGGING__REMOTE_LOGGING: "true"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "" # This is set by helm --set
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "google_cloud_default"
    AIRFLOW__LOGGING__COLORED_CONSOLE_LOG: "true"

    GOOGLE_APPLICATION_CREDENTIALS: "xxx/airflow-sa.json"
    PYTHONPATH: "/opt/airflow/dags/repo"

  connectionsUpdate: true
  connections:
    - id: google_cloud_default
      type: google_cloud_platform
      description: Vector GCP
      
      ## see the official "google_cloud_platform" connection docs for valid extra configs
      extra: |-
        {"extra__google_cloud_platform__key_path": "xxx/airflow-sa.json", "extra__google_cloud_platform__key_secret_name": "", "extra__google_cloud_platform__keyfile_dict": "", "extra__google_cloud_platform__num_retries": 5, "extra__google_cloud_platform__project": "", "extra__google_cloud_platform__scope": ""}

  extraVolumeMounts:
    ## spec for VolumeMount: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#volumemount-v1-core
    - name: google-service-account
      mountPath: xxx
      readOnly: true

  extraVolumes:
    ## spec for Volume: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.20/#volume-v1-core
    - name: google-service-account
      secret:
        secretName: gcloud-key-airflow
dags:
  gitSync:
    enabled: true
    repo: "xxx"
    branch: "master"
    revision: "HEAD"
    syncWait: 60
    sshSecret: "xxx"
    sshSecretKey: "xxx"
    repoSubPath: "dags"

ingress:
  enabled: false
  web:
    annotations: 
      kubernetes.io/ingress.class: nginx
      cert-manager.io/cluster-issuer: letsencrypt-prod
      nginx.ingress.kubernetes.io/rewrite-target: /$1
    path: ""
    host: ""

nhabbash avatar Aug 01 '22 10:08 nhabbash

@nhabbash a few questions:

  1. What version of airflow are you running (if its 1.10 you will need to review the older docs
  2. When you look into your GCS bucket, do you see any log files being created?
  3. Are you seeing logs in GCS for tasks that succeed? (airflow only uploads the logs at the end of a task, so if the pod crashes none will be uploaded)
  4. Are you 100% sure that Secret/gcloud-key-airflow contains a key called airflow-sa.json which is correctly formatted? (use kubectl get secret -o yaml to figure out)
  5. Why are you setting extra__google_cloud_platform__key_secret_name and extra__google_cloud_platform__keyfile_dict as "", I imagine that might overwrite extra__google_cloud_platform__key_path

thesuperzapper avatar Aug 02 '22 00:08 thesuperzapper

  1. What version of airflow are you running (if its 1.10 you will need to review the older docs

Airflow 2.3.3, mistakenly removed it from OP

  1. When you look into your GCS bucket, do you see any log files being created? I've actually noticed that for some DAGs the logs are being written to GCS. I have this DAG here that reports the following:
*** Reading remote log from gs://airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-02T07:38:30.880016+00:00/task_id=create_org_config_file/attempt=1.log.
*** Previous log discarded: 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Dcommon_create_organization_config%2Frun_id%3Dmanual__2022-08-02T07%3A38%3A30.880016%2B00%3A00%2Ftask_id%3Dcreate_org_config_file%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-02T07:38:30.880016+00:00/task_id=create_org_config_file/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

On GCS, I see the files also being created. I also rerun some of the other DAGs but I still get the following:

*** Unable to read remote log from gs://airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-02T09:40:13+02:00/task_id=list_batch_packs/attempt=2.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Ddo_labeled_pack_data_etl%2Frun_id%3Dmanual__2022-08-02T09%3A40%3A13%2B02%3A00%2Ftask_id%3Dlist_batch_packs%2Fattempt%3D2.log?alt=media: No such object: airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-02T09:40:13+02:00/task_id=list_batch_packs/attempt=2.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

If for some DAGs it's uploading the logs I believe the issue might not be the connection, but something with the the DAG definitions? I can share the code or definition if that's the case.

  1. Are you seeing logs in GCS for tasks that succeed? (airflow only uploads the logs at the end of a task, so if the pod crashes none will be uploaded)

Tasks always succeed or fail, I have no crashes happening (in fact Airflow is always reading the local logs). In GCS I don't see logs for the DAGs that report Unable to read remote log from ... 404 GET, as they aren't uploaded altogether. But as I discovered there are a couple of DAGs for which logs are uploaded to GCS, and I can see them both there and in Airflow, as they're read.

  1. Are you 100% sure that Secret/gcloud-key-airflow contains a key called airflow-sa.json which is correctly formatted? (use kubectl get secret -o yaml to figure out)

Yes, the key is there and it's mounted correctly by K8s in the PODs.

  1. Why are you setting extra__google_cloud_platform__key_secret_name and extra__google_cloud_platform__keyfile_dict as "", I imagine that might overwrite extra__google_cloud_platform__key_path

I tried just setting extra__google_cloud_platform__key_path, but the extra JSON field wouldn't parse in Airflow (I would get an error when using the connection). This solved it, I think you need to specify all the required fields. I did found this solution either on some Issue in the official Airflow repo or on StackOverflow but I can't find it again. Inspecting the connection in the UI and through the CLI shows that the connection has the right key_path, so it's not overwritten. Anyway, the connection works when using it for Hooks (I can list/upload/create buckets and blobs on GCS), and it seems like it is working sometimes for uploading logs, but most of the times it isn't.

nhabbash avatar Aug 02 '22 08:08 nhabbash

@nhabbash if you are seeing logs in GCS for some DAGs but not others, you should compare what is happening between those dags.

I almost guarantee that something is crashing in the DAGs which are failing to upload, as this is incredibly common (because airflow only uploads logs to GCS after the task is finished).

thesuperzapper avatar Aug 04 '22 01:08 thesuperzapper

@thesuperzapper the DAGs and their tasks are succeeding in both cases and not crashing - if they are crashing somewhere I'm not really seeing it. Here are some screenshots and logs (just the log loading and task succeeding message):

DAG w/ remote logging not working (first task example):

Screenshots

Screenshot from 2022-08-04 10-59-51

Screenshot from 2022-08-04 11-00-01

Log
*** Unable to read remote log from gs://airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-04T08:59:12+00:00/task_id=list_batch_packs/attempt=1.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Ddo_labeled_pack_data_etl%2Frun_id%3Dmanual__2022-08-04T08%3A59%3A12%2B00%3A00%2Ftask_id%3Dlist_batch_packs%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/local/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-04T08:59:12+00:00/task_id=list_batch_packs/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
*** Reading local file: /opt/airflow/logs/dag_id=do_labeled_pack_data_etl/run_id=manual__2022-08-04T08:59:12+00:00/task_id=list_batch_packs/attempt=1.log
...
...
[2022-08-04, 08:59:26 UTC] {python.py:173} INFO - Done. Returned value was: <REDACTED (it's a dict)>
[2022-08-04, 08:59:26 UTC] {taskinstance.py:1415} INFO - Marking task as SUCCESS. dag_id=do_labeled_pack_data_etl, task_id=list_batch_packs, execution_date=20220804T085912, start_date=20220804T085919, end_date=20220804T085926
[2022-08-04, 08:59:26 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2022-08-04, 08:59:26 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

DAG w/ remote logging working:

Screenshots

Screenshot from 2022-08-04 10-58-15

Screenshot from 2022-08-04 11-00-38

Log
*** Reading remote log from gs://airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-04T08:57:54.498373+00:00/task_id=create_org_config_file/attempt=1.log.
*** Previous log discarded: 404 GET https://storage.googleapis.com/download/storage/v1/b/airflow-logs-store/o/local%2Fdag_id%3Dcommon_create_organization_config%2Frun_id%3Dmanual__2022-08-04T08%3A57%3A54.498373%2B00%3A00%2Ftask_id%3Dcreate_org_config_file%2Fattempt%3D1.log?alt=media: No such object: airflow-logs-store/local/dag_id=common_create_organization_config/run_id=manual__2022-08-04T08:57:54.498373+00:00/task_id=create_org_config_file/attempt=1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
... 
...
[2022-08-04, 08:57:56 UTC] {python.py:173} INFO - Done. Returned value was: None
[2022-08-04, 08:57:56 UTC] {taskinstance.py:1415} INFO - Marking task as SUCCESS. dag_id=common_create_organization_config, task_id=create_org_config_file, execution_date=20220804T085754, start_date=20220804T085755, end_date=20220804T085756
[2022-08-04, 08:57:56 UTC] {local_task_job.py:156} INFO - Task exited with return code 0
[2022-08-04, 08:57:56 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check

nhabbash avatar Aug 04 '22 09:08 nhabbash

@nhabbash I think the docs forgot to say that you MUST set AIRFLOW__LOGGING__GOOGLE_KEY_PATH if not using "GKE Workload Identity".

Can you confirm that everything works when you set values like this:

airflow:
  config:
    AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "gs://<<MY-BUCKET-NAME>>/airflow/logs"
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "my_gcp"

    ## this is the new part \/
    AIRFLOW__LOGGING__GOOGLE_KEY_PATH: "/opt/airflow/secrets/gcp-keyfile/keyfile.json"
    ## this is the new part /\

  connections:
    - id: my_gcp
      type: google_cloud_platform
      description: my GCP connection
      extra: |
        {
          "extra__google_cloud_platform__key_path": "/opt/airflow/secrets/gcp-keyfile/keyfile.json",
          "extra__google_cloud_platform__num_retries: "5"
        }

  extraVolumeMounts:
    - name: gcp-keyfile
      mountPath: /opt/airflow/secrets/gcp-keyfile
      readOnly: true

  extraVolumes:
    - name: gcp-keyfile
      secret:
        ## assumes that `Secret/my-gcp-keyfile` contains a key called `keyfile.json`
        secretName: my-gcp-keyfile

Also, can you confirm 100% if only setting extra__google_cloud_platform__key_path and extra__google_cloud_platform__num_retries works, or if you need to specify the other ones?

thesuperzapper avatar Aug 05 '22 02:08 thesuperzapper

@thesuperzapper I did try setting AIRFLOW__LOGGING__GOOGLE_KEY_PATH previously, but it did not work. I tried again with these values:

airflow:
  config:
    AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "" # Set through helm --set
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "google_cloud_default"
    AIRFLOW__LOGGING__GOOGLE_KEY_PATH: "/opt/airflow/secrets/google-service-account/airflow-sa.json"

    GOOGLE_APPLICATION_CREDENTIALS: "/opt/airflow/secrets/google-service-account/airflow-sa.json"

  connectionsUpdate: true
  connections:
    - id: google_cloud_default
      type: google_cloud_platform
      description: GCP Connection
      
      extra: |-
        {"extra__google_cloud_platform__key_path": "/opt/airflow/secrets/google-service-account/airflow-sa.json", "extra__google_cloud_platform__num_retries": 5}

  extraVolumeMounts:
    - name: google-service-account
      mountPath: /opt/airflow/secrets/google-service-account
      readOnly: true

  extraVolumes:
    - name: google-service-account
      secret:
        secretName: gcloud-key-airflow

Unfortunately for remote logging I'm getting still the exact same output as my last post, that is the DAG that previously had remote logging still has it, the other does not, and the situation is identical (same logs and status). This is the config:

[logging]
base_log_folder = /opt/airflow/logs
remote_logging = True
remote_log_conn_id = google_cloud_default
google_key_path = /opt/airflow/secrets/google-service-account/airflow-sa.json
remote_base_log_folder = gs://airflow-logs-store/staging

One thing is good though, which is that the connection is set correctly even without defining all the extra fields (previously I was getting something similar to this, but maybe this chart does things differently, although I wouldn't be able to explain why I wasn't able to use the field with just key_path set previously.

nhabbash avatar Aug 05 '22 09:08 nhabbash

@naturalett can you try restarting all your worker pods (so they pick up new secret values)?

Also, where are you copying the [logging] configs you believe are in your cluster from?

thesuperzapper avatar Aug 05 '22 10:08 thesuperzapper

@thesuperzapper this was a fresh branch deployment, so the pods were newly created ones and the secrets were mounted anew. the logging config I got from running a shell in an Airflow worker pod and running airflow config list, just to check whether everything has been set correctly

nhabbash avatar Aug 05 '22 14:08 nhabbash

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label

stale[bot] avatar Oct 08 '22 06:10 stale[bot]

@nhabbash did you end up solving this logging issue?

thesuperzapper avatar Oct 10 '22 03:10 thesuperzapper

I am not sure why the bot closed this, as there was clearly activity in the last 60 days.

Either way, do you have any updates @nhabbash?

thesuperzapper avatar Oct 20 '22 00:10 thesuperzapper

I am not sure why the bot closed this, as there was clearly activity in the last 60 days.

Either way, do you have any updates @nhabbash?

Hey @thesuperzapper, sorry for the late reply. Unfortunately haven't found a solution for this, as I moved onto working on other stuff I haven't had the chance to revisit the issue. I tried checking again just now, and I'm still left with some DAGs that are able to write logs to GCS, and others that do not write them and hence lose the logs once the pods are restarted.

nhabbash avatar Nov 07 '22 15:11 nhabbash

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label

stale[bot] avatar Jan 07 '23 10:01 stale[bot]