SQLMesh hangs during plan execution when a composite primary key column is NULL
First of all thank you for releasing sqlglot and sqlmesh as open source projects! Big fan of both I am using latest version of SQLMesh, python 3.11, database is Postgres
I have noticed when I issue the sqlmesh plan command and do apply and backfill, the plan hangs indefinitely I noticed this is because one of the columns in the composite PK is null, when I use the insert query from backfill directly on Postgres then I get the exception. I was expecting the same exception to be intercepted by SQLMesh and fail fast
Please let me know if this is a mistake on my end or have others noticed this issue as well. Also what would be the workaround or fix for this?
Hey! What exception are you getting? Can you share the logs from the execution (./logs folder)?
@harnitbakshi do you happen to use batch_size on that model and run the batches concurrently?
@izeigerman I am using batch_size=1 and batch_concurrency=1. I have tried bigger batch sizes but the behavior (hang) is the same. Could it be that a thread gets the exception but fails to report to main thread and exits ?
@harnitbakshi by the way, it doesn't look like you actually attached the logs
I am adding a public repo and will reproduce it there and will share logs if I can reproduce
@izeigerman I have added a repo where we can reproduce the hang of sqlmesh: https://github.com/harnitbakshi/sqlmesh-hang, logs are under the logs folder in the repo
I have added a readme there on how to run the project
When a null value in one of the columns in the composite key is encountered(docker will show postgres with error:
But sqlmesh does not fail fast and hangs on this insert:
Actually what I noticed is this hang comes from the concurrency.py class, as raise_on_error is false, not sure why we dont fail fast here? But the _skip_next_nodes seems to be recursing from the start of the backfill which is 2025-03-03 in my case to 2025-05-13. Which is a lot of intervals and the recursion explodes
def _process_node(self, node: H, executor: Executor) -> None:
try:
self.fn(node)
with self._unprocessed_nodes_lock:
self._unprocessed_nodes_num -= 1
self._submit_next_nodes(executor, node)
except Exception as ex:
error = NodeExecutionFailedError(node)
error.__cause__ = ex
if self.raise_on_error:
self._finished_future.set_exception(error)
return
with self._unprocessed_nodes_lock:
self._unprocessed_nodes_num -= 1
self._node_errors.append(error)
self._skip_next_nodes(node)
Any advice on above would be appreciated!
Hey @harnitbakshi , this reproducible example is amazing. Also, great job identifying the root cause of the issue.
I've refactored the implementation to get rid of recursive calls in that section. It still takes a bit of time to calculate skipped nodes due to the number of batches (~20s), but it's much faster still and it doesn't hang any more.