superset icon indicating copy to clipboard operation
superset copied to clipboard

Superset worker restarting and Celery ignoring config values - HELM / Kubernetes / SQS / S3

Open dangal95 opened this issue 2 years ago • 0 comments

I am running Superset on Kubernetes (EKS v1.23, HELM chart v.0.7.7, Superset Docker image version "2-0"). I am using SQS as my celery broker and S3 as my results backend and cache. The S3 caching and results backend works, however the setup of using SQS as the broker is not working as expected.

How to reproduce the bug

The worker runs this command for a liveness probe: celery -A superset.tasks.celery_app:app inspect ping -d celery@$HOSTNAME. However, it is getting this error: Error: No nodes replied within time constraint. I'm not sure why this is happening, I've followed the following documentation pages and setup everything accordingly:

Celery Configuration - https://docs.celeryq.dev/en/stable/userguide/configuration.html Celery with SQS - https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/sqs.html Running Superset on Kubernetes - https://superset.apache.org/docs/installation/running-on-kubernetes Async Queries using Celery - https://superset.apache.org/docs/installation/async-queries-celery

This is my celery configuration inside the values.yaml files.

configOverrides:    
  enable_s3_caching: |
    from s3cache.s3cache import S3Cache
    from datetime import timedelta
    from flask import Flask
    from flask_caching import Cache
    from superset.config import *

    SQLALCHEMY_DATABASE_URI = f"postgresql://{db_username}:{db_password}@{host_name}/superset"
    S3_CACHE_BUCKET = BUCKET_NAME
    SQL_LAB_S3_CACHE_KEY_PREFIX = 'sql-lab-result/'
    CHARTING_DATA_S3_CACHE_KEY_PREFIX = 'chart-query-results/'
    FILTER_STATE_S3_CACHE_KEY_PREFIX = 'filter-state-results/'
    EXPLORE_FORM_S3_CACHE_KEY_PREFIX = 'explore-form-results/'
    THUMBNAIL_S3_CACHE_KEY_PREFIX = 'thumbnails/'

    RESULTS_BACKEND = S3Cache(S3_CACHE_BUCKET, SQL_LAB_S3_CACHE_KEY_PREFIX)

    def init_data_cache(app: Flask, config, cache_args, cache_options) -> S3Cache:
      return S3Cache(S3_CACHE_BUCKET, CHARTING_DATA_S3_CACHE_KEY_PREFIX)

    def init_filter_state_cache(app: Flask, config, cache_args, cache_options) -> S3Cache:
      return S3Cache(S3_CACHE_BUCKET, FILTER_STATE_S3_CACHE_KEY_PREFIX)
    
    def init_explore_cache(app: Flask, config, cache_args, cache_options) -> S3Cache:
      return S3Cache(S3_CACHE_BUCKET, THUMBNAIL_S3_CACHE_KEY_PREFIX)

    def init_thumbnail_cache(app: Flask, config, cache_args, cache_options) -> S3Cache:
      return S3Cache(S3_CACHE_BUCKET, EXPLORE_FORM_S3_CACHE_KEY_PREFIX)

    THUMBNAIL_CACHE_CONFIG = {'CACHE_TYPE': 'superset_config.init_thumbnail_cache'}
    DATA_CACHE_CONFIG = {'CACHE_TYPE': 'superset_config.init_data_cache'}
    FILTER_STATE_CACHE_CONFIG = {'CACHE_TYPE': 'superset_config.init_filter_state_cache'}
    EXPLORE_FORM_DATA_CACHE_CONFIG = {'CACHE_TYPE': 'superset_config.init_explore_cache'}
    SECRET_KEY = f"SECRET_KEY "    
    
    ENABLE_PROXY_FIX = True

    THUMBNAIL_SELENIUM_USER = SELENIUM_USER
    WEBDRIVER_BASEURL = BASE_URL

    CELERY_ENABLE_REMOTE_CONTROL = False

    class CeleryConfig:
      
      task_queues = None
      broker_url = "sqs://"
      broker_transport_options = {
        'region': 'eu-central-1',
      }
      imports = ("superset.sql_lab", 'superset.tasks',)
      worker_log_level = "DEBUG"
      worker_prefetch_multiplier = 1
      worker_enable_remote_control = False
      task_default_queue = "celery"
      #task_acks_late = False
      task_annotations = {
          "sql_lab.get_sql_results": {"rate_limit": "100/s"},
          
      }

    CELERY_CONFIG = CeleryConfig

Expected results

What I expect is that the worker does not fail on the liveness probe. Especially because Superset is able to automatically create a queue named "celery" in SQS. Furthermore, I expect celery not to create "pid" queues because I set worker_enable_remote_control = False AND CELERY_ENABLE_REMOTE_CONTROL = False.

Actual results

The worker is restarting because of the error Error: No nodes replied within time constraint, and worker_enable_remote_control = False is being totally ignored because I can see multiple queues (see screenshot) being created. image

The worker restarts multiple times (see screenshot) and every time it creates a lot of pid queues. image

When the worker starts again, this is the output I get from Celery: image

Environment

  • superset version: 2.0.1
  • python version: 3.8.12
  • EKS version: 1.23
  • Docker tag: 2-0
  • HELM Chart version: 0.7.7
  • celery[sqs] (pip) version: 5.2.7
  • boto3 (pip) version: 1.26.2
  • s3werkzeugcache (pip) version: 0.2.1

I would really appreciate your help on this because I can't seem to find anything online about it.

dangal95 avatar Nov 25 '22 10:11 dangal95