dbt-databricks icon indicating copy to clipboard operation
dbt-databricks copied to clipboard

noisy --fail-fast logs

Open taylorterwin opened this issue 1 year ago • 0 comments
trafficstars

User has raised that utilizing the --fail-fast flag for job runs in dbt Cloud scheduled runs is causing incredibly noisy logging, making surfacing the error and actual issue difficult.

  • 23 thread concurrency
  • There are models that are running at the same time
  • But fail fast says to terminate the run as soon as we run into a single error The logging is interesting - as we can see that the databricks adapter is going through cancelling the connections, meanwhile with queries that have started are still trying to connect to the server but the connection has been canceled, this error occurs:
: Error during request to server: RESOURCE_DOES_NOT_EXIST: Command 01ef6e95-db69-140e-a8f1-d4436107428d does not exist.
Error properties: attempt=1/30, bounded-retry-delay=None, elapsed-seconds=0.21970534324645996/900.0, error-message=RESOURCE_DOES_NOT_EXIST: Command 01ef6e95-db69-140e-a8f1-d4436107428d does not exist., http-code=404, method=GetOperationStatus, no-retry-reason=non-retryable error, original-exception=RESOURCE_DOES_NOT_EXIST: Command 01ef6e95-db69-140e-a8f1-d4436107428d does not exist., query-id=b'\x01\xefn\x95\xdbi\x14\x0e\xa8\xf1\xd4Ca\x07B\x8d', session-id=None

in addition, apache spark specific logging:

$anonfun$analyzeQuery$1(SparkExecuteStatementOperation.scala:541)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getOrCreateDF(SparkExecuteStatementOperation.scala:527)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.analyzeQuery(SparkExecuteStatementOperation.scala:541)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$execute$5(SparkExecuteStatementOperation.scala:633)
	at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:532)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$execute$1(SparkExecuteStatementOperation.scala:633)
	... 43 more
, operation-id=01ef6e95-cea5-18b1-8077-63b37a785969

databricks version: 1.8.5post2+6b29d329ae8a3ce6bc066d032ec3db590160046c dbt version: versionless - 2024.9.239

Expected behavior

from the user - I had assumed that was because we were using multiple threads, but I would expect it to fail nice and gracefully rather than provide a log consisting of 500 identical messages, and sometimes not even providing the original cause of the first model to fail.

taylorterwin avatar Sep 23 '24 13:09 taylorterwin