Inform the user about the explosion in the number of data intervals
I am currently struggling to understand why my project with 80 models is taking a long time (>10 minutes) and consumes a lot of resources (resource monitor lists RAM at ~14gb idle which rockets up to my 32gb limit) in the BackfillStage of planning. I use an OK spec 2023 Windows laptop for this.
This example is from running sqlmesh plan after sqlmesh destroy - so it's a worst case. However, evaluating sqlmesh plan dev after making a certain amount of changes can also cause the planning stage to take a very long time.
Since the planning stage emits no logs, it's hard to understand what happens, and why this happens. E.g. as of now I can't really say which models are taking a long time to process in the planning stage, which keeps me from knowing where to start evaluating my models.
I suspect this is due to model lineage being complicated, models being used in several downstream models, and so on - but with the current logging during plan evaluation, I don't really know.
Hm, the logging is definitely expected there. At the very least it absolutely should log the SQL that was executed. You can also try the debug mode by running sqlmesh --debug plan ... or by setting the SQLMESH_DEBUG=1 environment variable. This should produce a lot more logs
Hi @izeigerman ! I believe I have also tried with the debug argument in the CLI, but the planning stage itself didn’t emit any further logs during the plan evaluation (that in the image takes roughly 12 minutes) - perhaps I am misremembering?
I will try after kids are in bed.. 😜
So the debug logs (now on my Mac M1 which is way faster than my work laptop..) don't yield more information about what is happening within the BackfillStage plan evaluation. There is still no logging during a 40 sec interval here (although obviously 40 seconds instead of 12 minutes is way more palatable... 🤣):
Using activity monitor (which is severely restricted on my work laptop...), the RAM usage is revealed to be quite hefty, it peaked beyond this too. During the actual backfilling process the job sits at ~862mb, so the RAM increase really rockets during this evaluation, logging would be helpful to understand why! ⭐️
Note on RAM usage: This project is against an Azure SQL database where only a few small seed tables are not purely queries against the database itself. As far as I understand, no actual data is pulled down to sqlmesh (apart from state).
@MarcusRisanger do you happen to have models in there with a start date far into the past and the cron cadence something like "run every 5 minutes"? I think something like this could cause an explosion of data intervals, leading to significant slowdowns and high memory usage.
After some investigation it turns out you are right on the money @izeigerman 🚀 I reworked my partitioning macros, pushed start date from 1970 to 2025, halved my (aggressive) cron-settings and now it flies. I always assumed that the interval would be calculated based on start/end and the cron setting (and that the cron-setting was just a way to see if enough time has elapsed since last run, i.e. whether to evaluate the model or not), but it makes sense with the slowdowns if a snapshot interval object is generated for all valid intervals between start/end.
Perhaps the debugging logs should print how many data intervals were generated per model, or some such? That could be helpful in identifying such behavior!
@MarcusRisanger I agree with you. Actually, I think it would be better if we can detect the possible explosion of intervals in advance and warn the user directly in the console.