multipass icon indicating copy to clipboard operation
multipass copied to clipboard

[ci] Upload disk and snapcraft diagnostics

Open Copilot opened this issue 1 month ago • 4 comments

Add disk usage diagnostics to capture disk space information before heavy build steps:

  • [x] Analyze the linux.yml workflow structure to identify where to insert diagnostics
  • [x] Add disk diagnostics step before the main Build step (line 132)
  • [x] Add disk diagnostics step before the "Build and verify the snap" step (line 308)
  • [x] Add single artifact upload step at the end to capture all diagnostics
  • [x] Verify the YAML syntax is correct
  • [x] Run security checks (CodeQL) - no issues found
  • [x] Address review feedback
  • [x] All tasks completed

Summary

Added comprehensive disk usage diagnostics to .github/workflows/linux.yml to help diagnose the failing job https://github.com/canonical/multipass/actions/runs/19534052585/job/55935120817 which may be caused by disk space exhaustion.

Changes Made

Two diagnostic checkpoints added:

  1. Before Build step (line 132): Captures disk state before initial snapcraft build
  2. Before Snap Build step (line 308): Captures disk state before full snap creation

Each checkpoint captures:

  • df -h - disk usage in human-readable format
  • df -i - inode usage
  • du on top directories (/root, /home, /tmp, /var, /usr, runner temp/workspace)
  • Top 50 largest entries on / (single filesystem only)
  • All files >50M with their sizes (sorted numerically)
  • Snapcraft logs if available (full log from /home/runner/.local/state/snapcraft/log/)
  • GitHub Actions warning if free space < 2GB (first checkpoint only)

Implementation details:

  • Uses if: always() to ensure diagnostics run even on job failure
  • Uses continue-on-error: true to prevent diagnostic failures from failing the job
  • Single upload step at the end captures all diagnostic files
  • Artifact name: runner-disk-diagnostics-${{ matrix.build-type }}
  • Artifacts uploaded via actions/upload-artifact@v4 with wildcard pattern
  • Removed -x flag from set -euo pipefail to reduce noise
  • Removed empty echo lines between sections
  • Changed ls -lh to ls -l and sort by size numerically for proper sorting
  • Removed snapcraft log capture from first checkpoint (logs don't exist yet)
  • Removed disk space warning from second checkpoint (not useful after build)
  • Changed snapcraft log from tail to full cat in second checkpoint

Security

  • CodeQL analysis completed: No security issues found
  • All shell commands use proper error handling with || true to prevent failures
  • File operations safely handle missing files/directories
Original prompt

Problem: The failing CI run (job 55935120817) may be caused by the runner running out of disk space while building snapcraft/Flutter artifacts. GitHub-hosted runners do not publish ephemeral-disk metrics per-run, so we need to record disk usage at runtime to confirm or rule out space exhaustion.

Change requested: Add diagnostic steps to .github/workflows/dynamic-ci.yml that capture disk usage and large files right before the heavy build steps (snapcraft / flutter build) and upload them as an artifact so they can be inspected after the run.

Target file: .github/workflows/dynamic-ci.yml (use ref 9c630ed129a024d6d97ebf1f50d9162c9053e8a5 to reference current workflow) Link: https://github.com/canonical/multipass/blob/9c630ed129a024d6d97ebf1f50d9162c9053e8a5/.github/workflows/dynamic-ci.yml

What to add: Insert the following two steps immediately before the steps that run snapcraft / the heavy build (or at minimum, before the failing build step). Use if: always() so the diagnostics are recorded on both success and failure; use continue-on-error inside the step to avoid failing the job because of diagnostics.

YAML snippet to add:

  • name: Dump runner disk diagnostics if: always() run: | set -euxo pipefail OUT=runner-disk-diagnostics.txt echo "==== df -h ====" > "$OUT" df -h >> "$OUT" || true echo "" >> "$OUT" echo "==== df -i ====" >> "$OUT" df -i >> "$OUT" || true echo "" >> "$OUT" echo "==== du top dirs ====" >> "$OUT" du -sh /root /home /tmp /var /usr "${RUNNER_TEMP:-/tmp}" "${RUNNER_WORKSPACE:-/github/workspace}" 2>/dev/null >> "$OUT" || true echo "" >> "$OUT" echo "==== top 50 largest entries on / (no other FS) ====" >> "$OUT" du -ahx / 2>/dev/null | sort -rh | head -n 50 >> "$OUT" || true echo "" >> "$OUT" echo "==== find files >50M ====" >> "$OUT" find / -xdev -type f -size +50M -exec ls -lh {} ; 2>/dev/null | sort -k5 -h | tail -n 50 >> "$OUT" || true echo "" >> "$OUT"

    warn if available space < 2GB

    FREE_KB=$(df --output=avail -k / | tail -n1 | tr -d ' ') if [ -n "$FREE_KB" ] && [ "$FREE_KB" -lt $((210241024)) ]; then echo "##[warning] Less than 2GB available on / ($(($FREE_KB/1024)) MB)" >> "$OUT" fi

    capture snapcraft log if present (log path seen in failing job)

    if ls /home/runner/.local/state/snapcraft/log/snapcraft-.log 1> /dev/null 2>&1; then echo "" >> "$OUT" echo "==== snapcraft log tail ====" >> "$OUT" tail -n 400 /home/runner/.local/state/snapcraft/log/snapcraft-.log >> "$OUT" || true fi continue-on-error: true

  • name: Upload runner disk diagnostics if: always() uses: actions/upload-artifact@v4 with: name: runner-disk-diagnostics path: runner-disk-diagnostics.txt

Notes / rationale:

  • Running these diagnostics will let you confirm whether the runner ran out of disk/inodes before or during the build that failed copying libflutter_linux_gtk.so.
  • Place the steps before the snapcraft build or heavy Flutter build step; if the build fails early, use if: always() and keep them so the upload runs even after job failure.
  • The script captures df, inode usage, top directories, largest files, and any snapcraft logs referenced in the failing job logs.
  • The artifact will be retained with the run and can be downloaded for inspection.

Deliverable: Create a branch, add the snippet to .github/workflows/dynamic-ci.yml, and open a pull request titled: "ci: add disk usage diagnostics to dynamic-ci.yml". The PR should include the exact YAML insertion and a short description linking to the failing run: https://github.com/canonical/multipass/actions/runs/19534052585/job/55935120817

If you want I can also:

  • Add an automatic warning/early-fail when free space is below a configurable threshold (example uses 2GB), or
  • Limit ccache size (ccache -M 1G) and report ccache size concurrently, or
  • Place diagnostics both before and after particular steps to see growth during the job.

Please confirm you want me to create the PR with this change in canonical/multipass.

This pull request was created as a result of the following prompt from Copilot chat.

Problem: The failing CI run (job 55935120817) may be caused by the runner running out of disk space while building snapcraft/Flutter artifacts. GitHub-hosted runners do not publish ephemeral-disk metrics per-run, so we need to record disk usage at runtime to confirm or rule out space exhaustion.

Change requested: Add diagnostic steps to .github/workflows/dynamic-ci.yml that capture disk usage and large files right before the heavy build steps (snapcraft / flutter build) and upload them as an artifact so they can be inspected after the run.

Target file: .github/workflows/dynamic-ci.yml (use ref 9c630ed129a024d6d97ebf1f50d9162c9053e8a5 to reference current workflow) Link: https://github.com/canonical/multipass/blob/9c630ed129a024d6d97ebf1f50d9162c9053e8a5/.github/workflows/dynamic-ci.yml

What to add: Insert the following two steps immediately before the steps that run snapcraft / the heavy build (or at minimum, before the failing build step). Use if: always() so the diagnostics are recorded on both success and failure; use continue-on-error inside the step to avoid failing the job because of diagnostics.

YAML snippet to add:

  • name: Dump runner disk diagnostics if: always() run: | set -euxo pipefail OUT=runner-disk-diagnostics.txt echo "==== df -h ====" > "$OUT" df -h >> "$OUT" || true echo "" >> "$OUT" echo "==== df -i ====" >> "$OUT" df -i >> "$OUT" || true echo "" >> "$OUT" echo "==== du top dirs ====" >> "$OUT" du -sh /root /home /tmp /var /usr "${RUNNER_TEMP:-/tmp}" "${RUNNER_WORKSPACE:-/github/workspace}" 2>/dev/null >> "$OUT" || true echo "" >> "$OUT" echo "==== top 50 largest entries on / (no other FS) ====" >> "$OUT" du -ahx / 2>/dev/null | sort -rh | head -n 50 >> "$OUT" || true echo "" >> "$OUT" echo "==== find files >50M ====" >> "$OUT" find / -xdev -type f -size +50M -exec ls -lh {} ; 2>/dev/null | sort -k5 -h | tail -n 50 >> "$OUT" || true echo "" >> "$OUT"

    warn if available space < 2GB

    FREE_KB=$(df --output=avail -k / | tail -n1 | tr -d ' ') if [ -n "$FREE_KB" ] && [ "$FREE_KB" -lt $((210241024)) ]; then echo "##[warning] Less than 2GB available on / ($(($FREE_KB/1024)) MB)" >> "$OUT" fi

    capture snapcraft log if present (log path seen in failing job)

    if ls /home/runner/.local/state/snapcraft/log/snapcraft-.log 1> /dev/null 2>&1; then echo "" >> "$OUT" echo "==== snapcraft log tail ====" >> "$OUT" tail -n 400 /home/runner/.local/state/snapcraft/log/snapcraft-.log >> "$OUT" || true fi continue-on-error: true

  • name: Upload runner disk diagnostics if: always() uses: actions/upload-artifact@v4 with: name: runner-disk-diagnostics path: runner-disk-diagnostics.txt

Notes / rationale:

  • Running these diagnostics will let you confirm whether the runner ran out of disk/inodes before or during the build that failed copying libflutter_linux_gtk.so.
  • Place the steps before the snapcraft build or heavy Flutter build step; if the build fails early, use if: always() and keep them so the upload runs even after job failure.
  • The script captures df, inode usage, top directories, largest files, and any snapcraft logs referenced in the failing job logs.
  • The artifact will be retained with the run and can be downloaded for inspection.

Deliverable: Create a branch, add the snippet to .github/workflows/dynamic-ci.yml, and open a pull request titled: "ci: add disk usage diagnostics to dynamic-ci.yml". The PR should include the exact YAML insertion and a short description linking to the failing run: https://github.com/canonical/multipass/actions/runs/19534052585/job/55935120817

If you want I can also:

  • Add an automatic warning/early-fail when free space is below a configurable threshold (example uses 2GB), or
  • Limit ccache size (ccache -M 1G) and report ccache size concurrently, or
  • Place diagnostics both before and after particular steps to see growth during the job.

Please confirm you want me to create the PR with this change in canonical/multipass.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot avatar Nov 20 '25 16:11 Copilot

@copilot please address the review above.

ricab avatar Nov 20 '25 16:11 ricab

@copilot please address the review above.

All review comments addressed in commit 1fc922a.

Copilot avatar Nov 20 '25 16:11 Copilot

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 89.47%. Comparing base (43d1350) to head (1fc922a). :warning: Report is 15 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4520   +/-   ##
=======================================
  Coverage   89.47%   89.47%           
=======================================
  Files         243      243           
  Lines       13877    13877           
=======================================
  Hits        12416    12416           
  Misses       1461     1461           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 20 '25 20:11 codecov[bot]

We don't seem to be nearing space limits today.

Keeping this around for later.

ricab avatar Nov 21 '25 11:11 ricab