promscale icon indicating copy to clipboard operation
promscale copied to clipboard

Maintenance jobs unable to compress all chunks

Open mikberg opened this issue 3 years ago • 7 comments

Describe the bug

My Promscale instance is nearly constantly alerting PromscaleMaintenanceJobNotKeepingup, which seems to be because promscale_sql_database_chunks_metrics_uncompressed_count never quite reaches the set minimum of 10. Instead, it seems to vary in the interval between 600 and ~1100, depending on maintenance job settings.

I have tried running call prom_api.execute_maintenance(); manually, repeatedly (in a loop) and tried an aggressive schedule for maintenance jobs, running 4x every 5 minutes. They still seem to hit a "floor" around 600 uncompressed chunks.

Unfortunately, I haven't been able to run the full debugging query from the runbook, as the database goes into recovery mode whenever I try.

To Reproduce

Not sure.

Expected behavior

promscale_sql_database_chunks_metrics_uncompressed_count hitting values < 10 after maintenace jobs are done.

Screenshots Screenshot 2022-11-04 at 11 29 37

Configuration (as applicable)

  • Promscale Connector:
startup.dataset.config: |
  metrics:
    compress_data: true  # default
    default_retention_period: 90d  # default
    default_chunk_interval: 2h  # default is 8h; reduced in effort to mitigate PromscaleMaintenanceJobRunningTooLong
  traces:
    default_retention_period: 30d  # default
  • TimescaleDB:
shared_buffers: 1280MB
effective_cache_size: 3840MB
maintenance_work_mem: 640MB
work_mem: 8738kB
timescaledb.max_background_workers: 8
max_worker_processes: 13
max_parallel_workers_per_gather: 1
max_parallel_workers: 2
wal_buffers: 16MB
min_wal_size: 2GB
max_wal_size: 4GB
checkpoint_timeout: 900
bgwriter_delay: 10ms
bgwriter_lru_maxpages: 100000
default_statistics_target: 500
random_page_cost: 1.1
checkpoint_completion_target: 0.9
max_connections: 75
max_locks_per_transaction: 64
autovacuum_max_workers: 10
autovacuum_naptime: 10
effective_io_concurrency: 256
timescaledb.last_tuned: '2022-10-28T08:48:02Z'
timescaledb.last_tuned_version: '0.14.1'

Version

  • Distribution/OS:
  • Promscale: 0.16.0, 0.7.0 (extension)
  • TimescaleDB: 2.8.1

Additional context

  • PostgreSQL running via Crunchy postgres-operator, database is allocated 8 GB memory, on average using about 5-6 GB.
  • Average ingest at around 2000 samples/sec per Grafana dashboard.

mikberg avatar Nov 04 '22 11:11 mikberg

The number of uncompressed chunks depends on the number of unique metric names. Each metric name uses a hypertable and at any point in time there shouldn't be more than 2 chunks uncompressed per hypertable (the current one where current data is being written, the previous one which is kept open for one hour after the current chunk was created for data arriving late).

How many unique metric names do you have?

ramonguiu avatar Nov 05 '22 05:11 ramonguiu

The number of uncompressed chunks depends on the number of unique metric names. Each metric name uses a hypertable and at any point in time there shouldn't be more than 2 chunks uncompressed per hypertable (the current one where current data is being written, the previous one which is kept open for one hour after the current chunk was created for data arriving late).

How many unique metric names do you have?

tsdb=# select count(*) from information_schema.tables where table_schema='prom_metric';
 count
-------
  2617

(or 2015 label values for __name__ in Prometheus, might be some left-overs).

Thanks, this was very informative. Do I understand correctly if I take from this that I shouldn't really expect the uncompressed chunks count to fall much below 2*(number_of_unique_metric_names)? In that case, the default alert value of 10 sounds very low?

mikberg avatar Nov 07 '22 09:11 mikberg

Thanks, this was very informative. Do I understand correctly if I take from this that I shouldn't really expect the uncompressed chunks count to fall much below 2*(number_of_unique_metric_names)? In that case, the default alert value of 10 sounds very low?

Yes, that's correct. Let me check with the team why the alert is defined like that.

ramonguiu avatar Nov 13 '22 22:11 ramonguiu

I agree, this should be changed to

(
    min_over_time(promscale_sql_database_chunks_metrics_uncompressed_count[1h]) > 2 * promscale_sql_database_metric_count
)

Also, pinging @sumerman in case he knows the reason behind > 10.

harkishen avatar Nov 14 '22 09:11 harkishen

I agree, this should be changed to

(
    min_over_time(promscale_sql_database_chunks_metrics_uncompressed_count[1h]) > 2 * promscale_sql_database_metric_count
)

Also, I think we should change min_over_time to avg_over_time. Reason? min_over_time in this case seems too strict, since at any given point in last 1h, if the uncompressed chunks are more than expected, it will alert. Averaging this over 30m should be fine.

Also, pinging @sumerman in case he knows the reason behind > 10.

Thank you. As I have answered elsewhere my intention defining this metric was for it to go down to 0. 10 was a safety margin.

sumerman avatar Nov 14 '22 09:11 sumerman

@sumerman did we fix this?

ramonguiu avatar Dec 13 '22 22:12 ramonguiu

I expect https://github.com/timescale/promscale/pull/1794 to fix this when it lands

sumerman avatar Dec 15 '22 15:12 sumerman