iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Build: Free disk space before running action in Spark CI

Open manuzhang opened this issue 1 year ago • 3 comments

I've seen Spark CI failure due to no disk space

org.apache.iceberg.spark.extensions.TestCopyOnWriteMerge > testMergeWithConcurrentTableRefresh[catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}, format = parquet, vectorized = true, distributionMode = none, branch = test] FAILED
    java.lang.AssertionError: 
    Expecting actual throwable to be an instance of:
      java.lang.IllegalStateException
    but was:
      org.apache.spark.SparkException: Writing job aborted
    	at org.apache.spark.sql.errors.QueryExecutionErrors$.writingJobAbortedError(QueryExecutionErrors.scala:767)
    	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:409)
    	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:353)
    	...(41 remaining lines not displayed - this can be changed with Assertions.setMaxStackTraceElementsDisplayed)
        at org.apache.iceberg.spark.extensions.TestCopyOnWriteMerge.testMergeWithConcurrentTableRefresh(TestCopyOnWriteMerge.java:148)

org.apache.iceberg.spark.extensions.TestCopyOnWriteMerge > testMergeWithMultipleUpdatesForTargetRowSmallTargetLargeSource[catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}, format = parquet, vectorized = true, distributionMode = none, branch = test] FAILED
Error: a.lang.AssertionError: [Should 2024-02-18T05:06:00.9975674Z ##[error]No space left on device : '/home/runner/runners/2.313.0/_diag/pages/943a8a72-7ff9-49d1-b4cb-09d7db8a44a1_80d440e4-f54b-5560-0192-53fee83660bc_1.log'

This PR attempts to free unneeded disk space with free-disk-space action.

With this action, it saved 27GiB in one Spark CI build.

Run jlumbroso/[email protected]
Run # ======
================================================================================
BEFORE CLEAN-UP:

$ dh -h /

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   56G   17G  77% /
...
================================================================================
AFTER CLEAN-UP:

$ dh -h /

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        73G   32G   41G  45% /
...
overall:

********************************************************************************
=> Saved 27GiB

manuzhang avatar Feb 23 '24 15:02 manuzhang

@Fokko and @singhpk234 please take a look at your convenience.

manuzhang avatar Feb 26 '24 02:02 manuzhang

@manuzhang: Very nice to see this addition. Have we benchmarked how long it took to clean it up and overall increase in CI time with this?

ajantha-bhat avatar Feb 26 '24 12:02 ajantha-bhat

@ajantha-bhat It took around two minutes per action run. I suppose actions are run in parallel so that's also the overall increase time? CleanShot 2024-02-26 at 21 40 35@2x

manuzhang avatar Feb 26 '24 13:02 manuzhang

Seems to be still an issue in the latest runs: https://github.com/apache/iceberg/actions/runs/8200693411/job/22427953001

nastra avatar Mar 08 '24 10:03 nastra

Is it due to no disk space? The log is no longer available

manuzhang avatar Mar 08 '24 15:03 manuzhang

Yep, it was due to disk space. So maybe something in Iceberg Spark 3.3 has a memory leak and that's how it surfaces

nastra avatar Mar 08 '24 15:03 nastra