SynapseML [BUG] Failure to communicate with tenant in West US

SynapseML version

Fabric 1.3 (com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader) SynapseML '1.0.8-spark3.5'

System information

Language version python 3.11, scala 2.12
Spark Version 3.5
Spark Platform Fabric Runtime 1.3

Describe the problem

This library (Synapse ML) is causing problems inside of Fabric. It appears to be running inside of Fabric, while executing Spark SQL statements against a semantic model. com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader

We already turned off the automatic logging of ML for experiments and models. (That had been causing problems for us in the past. Hopefully it is not a problem to turn that stuff off.)

The errors in my spark job are meaningless, and seems to be unrelated to the actual work that I'm doing. The errors appear to be related to some perfunctory interaction with our Fabric tenant hosted in West US.

Here are the details:


Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 8) (vm-a9200333 executor 1): java.net.SocketTimeoutException: PowerBI service comm failed (https://WABI-WEST-US-C-PRIMARY-redirect.analysis.windows.net/v1.0/myOrg/internalMetrics/query)
	at com.microsoft.azure.synapse.ml.powerbi.PBISchemas.post(PBISchemas.scala:100)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.$anonfun$executeQuery$1(PBIMeasurePartitionReader.scala:107)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:163)
	at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:160)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.logVerb(PBIMeasurePartitionReader.scala:17)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.executeQuery(PBIMeasurePartitionReader.scala:105)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIMeasurePartitionReader.init(PBIMeasurePartitionReader.scala:142)
	at com.microsoft.azure.synapse.ml.powerbi.measure.PBIReaderFactory.createReader(PBIMeasureScan.scala:26)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:84)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)

Here is a screenshot of the query and the error:

Notice that I'm simply using "semantic-link" to run a query against a PBI dataset. I'm guessing that 95% of the work is primarily performed on a driver.

I'm hoping I will get some support from here. The error seems related to this community, and not so much related to Fabric. I will otherwise wait a couple of weeks for Mindtree to respond (pro support). At the end, they would probably need help from this community to understand the behavior of SynapseML in Fabric.

Any tips would be very much appreciated.

Code to reproduce issue

%%sql

SELECT
`Fiscal Week[Fiscal Week]`,
`Random[Code]`,

-- SUM(PriceMbfUsd)
 SUM(`USD Price MBF`),
 SUM(`USD Price MSF`)

FROM
    pbi.RandomLengthModel._Metrics


WHERE
 `Fiscal Week[Fiscal Year Number]` = 2025
 AND
 `Fiscal Week[Fiscal Week Number]` = 2

 GROUP BY 
`Fiscal Week[Fiscal Week]`,
`Random[Code]`

Other info / logs

None

What component(s) does this bug affect?

[ ] area/cognitive: Cognitive project
[ ] area/core: Core project
[ ] area/deep-learning: DeepLearning project
[ ] area/lightgbm: Lightgbm project
[ ] area/opencv: Opencv project
[ ] area/vw: VW project
[ ] area/website: Website
[ ] area/build: Project build system
[ ] area/notebooks: Samples under notebooks folder
[ ] area/docker: Docker usage
[ ] area/models: models related issue

What language(s) does this bug affect?

[ ] language/scala: Scala source code
[ ] language/python: Pyspark APIs
[ ] language/r: R APIs
[ ] language/csharp: .NET APIs
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[x] integrations/synapse: Azure Synapse integrations
[x] integrations/azureml: Azure ML integrations
[ ] integrations/databricks: Databricks integrations

Feb 11 '25 18:02 dbeavon

Hi @eisber

Sorry to bother you here, but the Semantic Link is putting an odd dependency on Synapse ML. I was wondering if you have any knowledge about this dependency stack.

I'm guessing that the problem I'm facing is a bug in the Semantic Link for spark (SparkLink). The relevant code is not found on the internet. ... If my problem was in SynapseML itself, then I'm guessing I would be able to find the callstacks here in this community.

The problem may be related to the fact that our tenant lives in West US and the capacity lives in North Central. We are getting some basic timeout/connectivity issues. This is not an intermittent issue. I have a ticket open, but I'm worried that the Mindtree folks will take weeks to set up a repro and contact you.

I don't think the scenario involves components that are still in "preview".

Feb 11 '25 19:02 dbeavon

can you find a RAID (root activity id) in the logs? how long does the same query take if you use sempy evaluate_measure?

Feb 18 '25 14:02 eisber

Hi @eisber I can ask the engineer for his RAID. I was able to send a repro over to the Mindtree side of things.

The full case is reported with the following title and Mindtree case number.
Spark job failing when using semantic link - TrackingID#2502110040012091 I don't think we have created an ICM with Microsoft yet.

Here is an example of a spark native connector query that errors (seemingly because of synapse.ml.powerbi):

The main problems appear to be when introducing "WHERE" clauses. That part of the query appears to be parsed for syntax, but don't seem to have an impact on the SQL profiler queries in the PBI dataset. Moreover, in some cases I can omit the "WHERE" clause, as a way to avoid the error messages.

FYI, Here is a comparable DAX that works great, when it is crafted by hand.

Notice it should take ~5 ms and return under 200 rows.

I'm having a hard time understanding the behavior of this "spark native connector", and I can't distinguish the functionality that will work reliably from the functionality that is broken. My biggest concerns are that "WHERE" clauses seem to be ignored. The secondary concern is that there is a restrictive "TOPN()" applied to the DAX query. That restriction rarely gets enough data from the dataset model, especially when WHERE clause is omitted:

It sounds like you are encouraging us to use the sempy ("evaluate_whatever") methods on the driver as a fallback whenever the spark native connector is misbehaving. Is that so? Are both of them supported as GA features of Fabric?

Feb 19 '25 19:02 dbeavon

the spark native connector path you're using maps to the this function in sempy: https://learn.microsoft.com/en-us/python/api/semantic-link-sempy/sempy.fabric?view=semantic-link-python#sempy-fabric-evaluate-measure

if you don't want to joined queris over semantic model and spark, I'd strongly recommend to use the python API on the driver node.

Feb 19 '25 20:02 eisber

BTW, should we move this discussion to a different github? Seems like this is only loosely related to the open source synapse.ml.

I think I understand that the (A) spark native connector is using the (B) evaluate_measure ... but perhaps it is doing so on a remote executor rather than a driver. So I think you are saying that a problem in one of these (A or B) will always affect the other and I can simplify the repro by swapping one for the other?

I have a tendency to gravitate towards spreading requests out to the executors, given my past experiences with Apache Spark. If any given spark guy is doing everything on the driver, then people will tell us we are doing it wrong (ie. why use a cluster at all). The hope is that some day there will be optimizations or query hints that allow the work to be distributed across executors, and thereby improve the overall execution time.

Of course the bottleneck will ultimately just move. Slow operations on the Spark cluster will ultimately be made faster but the bottleneck will end up at the PBI dataset model.... so it is really doubtful that it makes any difference if queries are submitted from executors or drivers.

At the end, the only real benefit I expected to get out of the spark native connector is to avoid as much DAX as possible. ;) I love MDX and SQL but have some love and some hate for DAX.

Is the spark native connector at least supported?

... I have started getting doubts about that, given the obscure error: java.net.SocketTimeoutException: PowerBI service comm failed (https://WABI-WEST-US-C-PRIMARY-redirect.analysis.windows.net)

... resulting from a slight change in query syntax.

As per your strong recommendation, I have no doubt that I would be able to get the python API working on the driver, by hook or by crook. My only question at this point is whether to avoid the spark native connector for future workloads in pyspark.

Feb 19 '25 20:02 dbeavon

I don't know your dataset size, but from past experiments for most standard semantic model size you won't see any improvement by using spark or even trying to optimize by moving compute to executors. In general the recommendation to perform computation on the executors for spark jobs is reasonable, but that's for datasets of multiple GB/TB.

If your dataset fits into memory on the driver node, you're probably even faster as you don't have any distributed system overhead.

Feb 19 '25 21:02 eisber

Right. Most of my PBI datasets are small. At my company I'm guessing that 99% of our PBI datasets are under 5 GB, (and could fit easily in duckdb or sqlite).

Still when running a solution on a Spark cluster, people expect to follow Spark patterns. I'm assuming that is why Microsoft created the spark native connector in the first place. Using SQL statements against PBI datasets is also appealing.

Based on your recommendation, I'll start using sempy on the driver ... in the pandas ecosystem ... and subsequently build the spark frame after the fact when I need one. Eg. via: https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html#DataFrame-Creation

In the future we may need to combine this data (PBI models) with some other pre-existing spark solution, or delta-table or whatever. Whenever that happens it feels a bit "dirty" if one piece of data forces the whole business to be collect()'ed up to the driver. To avoid that dirty operation, a spark developer would typically push even the small datasets down to the workers. (And the native spark connector would theoretically save us from writing that code ourselves.)

Feb 19 '25 22:02 dbeavon

@eisber

There is an ICM now (finally). It generally takes two weeks for Mindtree to open one with Microsoft. The ICM is attached to the following SR: TrackingID#2502110040012091

The engineer has a full repro. First he built a repro using customer profitability sample dataset. That works fine. Then he switched to use my dataset, and got a similar error in synapse.ml:

I think he should have attached the pbix and notebook to the ICM. Any help would be appreciated. The Microsoft side of these support tickets generally takes even longer than the Mindtree side. (The only exceptions to that are when we can motivate an FTE to find their own reasons to assist. )

My only goal for now is to understand the error message and possibly get past it. Any error message would be more meaningful than the SocketTimeoutException that we see above. Another potential goal would be to get something added to the "known issues" list for Fabric. I think that is supposed to happen for bugs that impact GA features.

... Just to re-iterate, we think the "WHERE" clause is causing the majority of the problem, although it's not clear why that would lead to a SocketTimeoutException. There was a "WHERE" clause in the customer profitability examples, and it is working fine. It would be nice to figure out the pattern that explains when these clauses cause problemes.

Feb 24 '25 04:02 dbeavon

Hi @eisber I think there is finally an ICM on the PG side. It is frustrating that it always takes two or three weeks.

They are also telling me to use evaluate_dax/evaluate_member on the driver.

I thought this "spark native connector" was GA , and not preview. Does it not come with any support? Why won't they at least tell me the reason for the error message?

In my opinion there needs to be public-facing communication about what parts of sempy are expected to work, well, and what parts are not expected to work well. And what parts are unsupported altogether. I have spent an excessive amount of time on this and I'm starting to feel a bit like a guinea pig.

Feb 25 '25 20:02 dbeavon

There seem to be multiple issues. a) error propagation with spark native connector for your scenario b) something related to the where clause.

I suggested evaluate_measure in the ICM to find the root-cause for the where clause related issue, as evaluate_measure is using the same backend as semantic link spark native.

Feb 26 '25 12:02 eisber

Thanks @eisber

For the benefit of others that encounter the same error message, my SR case is TrackingID#2502110040012091. ... there is a related ICM, a full repro, and we made a tremendous amount of progress thanks to some a very helpful FTE who was willing to assist on the PG side!

As I understand, the problem is related to the optimization of the SQL dialect used in spark. Sometimes the predicates are propagated back to the semantic model and sometimes they are not. When predicates are NOT propagated back to the semantic model there is a much higher risk that a restrictive timeout will be encountered.

The restrictive timeout is a client-side timeout that happens after ~10 seconds. It normally isn't observed for about six minutes, because of several layers of retry implementation that are going on (both in the connector, and in the spark "maxattempts".) As I understand the 10 second timeout is implemented by disconnecting from the DAX query on the client side - in the spark native connector.

Ideally the 10 second timeout would be configurable but, even if it is not, I think there are other opportunities here to (1) reduce the number of retries, so that a 10 second failure doesn't get repeated for the next six minutes, and (2) improve the error message to users and make sure we get something that is a little more actionable so that we can change course more quickly. I am not opposed to workarounds, or even bugs. The thing that is most troubling is not having an error that can be easily googled, and thereby being blocked on a project for a period of time. NOTE: the Microsoft product teams (PG) have fewer upstream dependencies, as compared to a customer. And the ones they have are typically opensource so there is a tremendous amount of transparency. In contrast, a customer can be blocked by an opaque issue for weeks or months at a time, especially if the error/symptom can't be easily googled. I think this github discussion will be extremely useful other customers who encounter this issue, and they will certainly appreciate the guidance you gave about trying "evaluate_measure" as an alternative approach.

As a final note, I wanted to point out that the two self-service tools that we used to investigate were: (1) the explain-plan in spark, and (2) a sql profiler that shows the DAX activities in the remote semantic model. Between the two of these, it is a little more easy to try to figure out what the spark native connector is doing. (Although the ten second retries are still a little hard to accept).

Mar 03 '25 14:03 dbeavon