automq icon indicating copy to clipboard operation
automq copied to clipboard

[Enhancement] WAL Disk Usage Metrics

Open CtrlAltDft opened this issue 1 year ago • 1 comments

Who is this for and what problem do they have today?

Target Audience:

  • Engineers who manage and maintain AutoMQ clusters, particularly those deployed on AWS infrastructure using EBS volumes for Write-Ahead Log (WAL) storage.

Problem Statement:

  • Lack of Visibility into WAL Disk Usage:
    • AutoMQ uses EBS volumes mounted as block devices for WAL storage to optimize performance.
    • Standard disk usage monitoring tools cannot track free disk space on block devices that are not mounted with a traditional file system.
    • Administrators currently have limited metrics available, restricted to IOPS and read/write throughput, which do not provide insights into actual disk space utilization.
  • Operational Challenges:
    • Without accurate metrics on WAL disk usage, there is a risk of unexpected disk space exhaustion, which can lead to system crashes or data loss.
    • Capacity Planning Difficulties: Inability to forecast when additional storage is needed hampers proactive resource management.
    • Alerting Limitations: Lack of thresholds and alerts for disk usage prevents timely intervention before critical issues arise.

Why is solving this problem impactful?

  • Ensures System Reliability and Stability:
    • Monitoring WAL disk usage helps prevent service interruptions caused by full disks.
    • Enables proactive maintenance, reducing the risk of data loss or corruption.
  • Improves Operational Efficiency:
    • Provides administrators with the necessary insights to make informed decisions about scaling storage resources.
    • Facilitates capacity planning, ensuring that resources are allocated efficiently and cost-effectively.
  • Enhances Monitoring and Alerting Capabilities:
    • Allows integration with existing monitoring tools to set up alerts and notifications when disk usage reaches critical levels.
    • Empowers teams to respond quickly to potential issues, minimizing downtime.
  • Aligns with Best Practices:
    • Adhering to industry standards for system monitoring and observability.
    • Helps maintain high availability and performance of AutoMQ clusters.
  • Supports Autoscaling Efforts:
    • Accurate metrics are essential for implementing event-driven autoscaling, ensuring that scaling actions are based on reliable data.
    • Enhances the effectiveness of auto-balancing mechanisms by providing comprehensive system insights.

Additional Notes

  • Possible Solutions:
    • Expose WAL Disk Usage Metrics:
      • AutoMQ could provide built-in metrics for WAL disk utilization accessible via standard monitoring interfaces (e.g., JMX, Prometheus exporters).

CtrlAltDft avatar Sep 27 '24 01:09 CtrlAltDft

Hi, @CtrlAltDft S3Stream uses WAL Storage as a ring buffer. In my opinion, we don't need to monitor the WAL usage, because by design, it's meant to be exhausted soon.

daniel-y avatar Feb 09 '25 03:02 daniel-y