sqlmesh icon indicating copy to clipboard operation
sqlmesh copied to clipboard

SQLMesh hangs during plan execution when a composite primary key column is NULL

Open harnitbakshi opened this issue 8 months ago • 3 comments

First of all thank you for releasing sqlglot and sqlmesh as open source projects! Big fan of both I am using latest version of SQLMesh, python 3.11, database is Postgres

I have noticed when I issue the sqlmesh plan command and do apply and backfill, the plan hangs indefinitely I noticed this is because one of the columns in the composite PK is null, when I use the insert query from backfill directly on Postgres then I get the exception. I was expecting the same exception to be intercepted by SQLMesh and fail fast

Please let me know if this is a mistake on my end or have others noticed this issue as well. Also what would be the workaround or fix for this?

harnitbakshi avatar Apr 29 '25 11:04 harnitbakshi

Hey! What exception are you getting? Can you share the logs from the execution (./logs folder)?

izeigerman avatar Apr 30 '25 20:04 izeigerman

@harnitbakshi do you happen to use batch_size on that model and run the batches concurrently?

izeigerman avatar May 02 '25 15:05 izeigerman

@izeigerman I am using batch_size=1 and batch_concurrency=1. I have tried bigger batch sizes but the behavior (hang) is the same. Could it be that a thread gets the exception but fails to report to main thread and exits ?

harnitbakshi avatar May 03 '25 22:05 harnitbakshi

@harnitbakshi by the way, it doesn't look like you actually attached the logs

izeigerman avatar May 05 '25 20:05 izeigerman

I am adding a public repo and will reproduce it there and will share logs if I can reproduce

harnitbakshi avatar May 07 '25 23:05 harnitbakshi

@izeigerman I have added a repo where we can reproduce the hang of sqlmesh: https://github.com/harnitbakshi/sqlmesh-hang, logs are under the logs folder in the repo

I have added a readme there on how to run the project

When a null value in one of the columns in the composite key is encountered(docker will show postgres with error:

Image

But sqlmesh does not fail fast and hangs on this insert:

Image

Actually what I noticed is this hang comes from the concurrency.py class, as raise_on_error is false, not sure why we dont fail fast here? But the _skip_next_nodes seems to be recursing from the start of the backfill which is 2025-03-03 in my case to 2025-05-13. Which is a lot of intervals and the recursion explodes

def _process_node(self, node: H, executor: Executor) -> None:
        try:
            self.fn(node)

            with self._unprocessed_nodes_lock:
                self._unprocessed_nodes_num -= 1
                self._submit_next_nodes(executor, node)
        except Exception as ex:
            error = NodeExecutionFailedError(node)
            error.__cause__ = ex

            if self.raise_on_error:
                self._finished_future.set_exception(error)
                return

            with self._unprocessed_nodes_lock:
                self._unprocessed_nodes_num -= 1
                self._node_errors.append(error)
                self._skip_next_nodes(node)

harnitbakshi avatar May 12 '25 21:05 harnitbakshi

Any advice on above would be appreciated!

harnitbakshi avatar May 15 '25 04:05 harnitbakshi

Hey @harnitbakshi , this reproducible example is amazing. Also, great job identifying the root cause of the issue.

I've refactored the implementation to get rid of recursive calls in that section. It still takes a bit of time to calculate skipped nodes due to the number of batches (~20s), but it's much faster still and it doesn't hang any more.

Image

izeigerman avatar May 26 '25 17:05 izeigerman