Purview-ADB-Lineage-Solution-Accelerator icon indicating copy to clipboard operation
Purview-ADB-Lineage-Solution-Accelerator copied to clipboard

Failures in OpenlineageIn Function

Open Kishor-Radhakrishnan opened this issue 1 year ago • 13 comments

We have implemented purview in many databricks workspaces. We are missing some lineage in UI. When troubleshooting issue , we can see the function app is failing multiple times due to event hub limit. We suspect this is causing lineage gaps.

Result: Error in OpenLineageIn function: The message (id:9471c2ab-59dc-41e6-9b44-38705ac613b3, size:23347619 bytes) is larger than is currently allowed (1048576 bytes). (eventhubmaestroadbpct6)
Exception: Azure.Messaging.EventHubs.EventHubsException(MessageSizeExceeded): The message (id:9471c2ab-59dc-41e6-9b44-38705ac613b3, size:23347619 bytes) is larger than is currently allowed (1048576 bytes).

Is there any option to overcome this eventhub limitation to avoid missing of lineage events.

Kishor-Radhakrishnan avatar Mar 22 '23 06:03 Kishor-Radhakrishnan

@Kishor-Radhakrishnan I apologize for the delay in getting back to you.

You can enable a configuration setting that will remove the spark-plan if it exceeds a certain size.

https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/blob/release/2.3/docs/configuration.md#experimental-app-settings

Set maxQueryPlanSize to a value smaller than 1048576 - we need to take into account the rest of the OpenLineage payload as well so don't set it to exactly 1048576 but rather maybe something like 1000000 if you want to maximize how often you are receiving the spark plan in the properties of the databricks_notebook_task (just raw json from Spark showing the plan, there is no additional UI feature that uses this spark plan and none that is planned).

If you don't care to see the spark plan text / json in your properties, you could set maxQueryPlanSize even smaller to ensure you always get lineage events through even when you have a large number of inputs (that take up more bytes in the message going to event hub).

wjohnson avatar Mar 27 '23 13:03 wjohnson

@wjohnson I tried setting with much lower value . Still we are seeing failures.

Set value as 10000

Kishor-Radhakrishnan avatar Mar 29 '23 08:03 Kishor-Radhakrishnan

Just for the sake of testing, can you make it a much smaller size @Kishor-Radhakrishnan? Try setting it to 50 and let us know the outcome.

hmoazam avatar Apr 08 '23 20:04 hmoazam

Tried the same . Still we are seeing many errors

Kishor-Radhakrishnan avatar Apr 12 '23 09:04 Kishor-Radhakrishnan

@wjohnson I tried setting with much lower value . Still we are seeing failures.

Set value as 10000

Would you be able to share the latest logs after adding this setting? You should see something like this in the OpenLineageIn logs Query Plan size exceeded maximum. Removing query plan from OpenLineage Event

wjohnson avatar Apr 17 '23 13:04 wjohnson

Yes, am seeing that in logs . But still it looks many events are exceeding eventhub limits

Kishor-Radhakrishnan avatar Apr 17 '23 13:04 Kishor-Radhakrishnan

Screenshot 2023-04-17 at 7 12 28 PM

Kishor-Radhakrishnan avatar Apr 17 '23 13:04 Kishor-Radhakrishnan

query_data.csv.zip

Latest exception logs

Kishor-Radhakrishnan avatar Apr 17 '23 13:04 Kishor-Radhakrishnan

@Kishor-Radhakrishnan thank you for your patience! These last logs helped us identify an error in the OpenLineageIn code that removed the spark plan in one variable but failed to remove the spark plan in another variable. That other variable was the one actually sending data to Event Hub!

I've put the changes in this branch https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator/tree/hotfix/maxQueryPlanOLIn Would you be able to build this branch and deploy to your environment and confirm that maxQueryPlanSize is being respected for OpenLineageIn and PurviewOut?

Thank you again for all of your patience.

wjohnson avatar Apr 18 '23 07:04 wjohnson

Screenshot 2023-04-18 at 4 06 32 PM

I have deployed latest fix. Lets monitor failures further. Looks plan is getting omitted now . Check the latest logs screenshot after fix .

Kishor-Radhakrishnan avatar Apr 18 '23 10:04 Kishor-Radhakrishnan

Unfortunately we still have many failures with same issue. But failure counts got reduced it looks

4/19/2023, 3:51:51.4891988 PM (Local time)

Result: Error in OpenLineageIn function: The message (id:232abc39-61f0-45c6-8644-f53d68c84ecd, size:36280210 bytes) is larger than is currently allowed (1048576 bytes). (eventhubmaestroadbpct6) Exception: Azure.Messaging.EventHubs.EventHubsException(MessageSizeExceeded): The message (id:232abc39-61f0-45c6-8644-f53d68c84ecd, size:36280210 bytes) is larger than is currently allowed (1048576 bytes). (eventhubmaestroadbpct6)

Kishor-Radhakrishnan avatar Apr 20 '23 06:04 Kishor-Radhakrishnan

Hi @Kishor-Radhakrishnan , we are facing similar issue with spark jobs in my organization, did you manage to make it works? If yes how? Thanks Cc : @wjohnson

rabbyn avatar Oct 20 '23 21:10 rabbyn

This will be fixed in the next release where we will remove the spark plan and then column lineage information if the payload is still larger than the 1 MB payload limit. There will be future consideration for reducing mount points as in #219

It's still possible that there will be sections of the payload that result in too much information such as:

  1. Mount Points on the cluster
  2. Too many inputs
  3. Too many outputs

But only the mount points issue has been encountered so far. It still needs to be determined how to solve the mount point issue.

wjohnson avatar Dec 30 '23 06:12 wjohnson