overwatch
overwatch copied to clipboard
Enable multiple workspaces per eventhub
Currently the technical architecture requires 1 EH NS / region and 1 EH / databricks workspace. We'd like to lower this technical requirement to enable 1 EH NS AND 1EH / region.
For customers with large numbers of workspaces, this will simplify infrastructure management and lower costs.
EventHub is anyway priced per namespace, not per topic, so this change won't affect the pricing. Per pricing docs:
Throughput units apply to all event hubs in a namespace
also see the FAQ
right, but there is a limitation on Event hubs per namespace i believe (or per subscription). After investigation, I've determined that it's possible to do this but it would require another customer mapping (Azure Workspace object path) to Databricks workspace id. I hate to make yet another configuration but it will be necessary for now to enable this.
Below is the only key we get directly from EH so we'd need to map that to workspace id
we can get subscription ID from cluster tags. but getting workspace name & resource group - I don't see that information
Adding to 6.1.3 for feasibility review
@alexott -- I have created this filter string to enable this -- but I'm now concerned about the amount of data scanned and thrown away increasing runtimes and/or costs (EH egress and compute). Seems like it still may be best practice to have one EH per Workspace to limit these costs. Thoughts?
val subscriptionID = spark.conf.get("spark.databricks.clusterUsageTags.azureSubscriptionId").toLowerCase
val workspaceDeploymentName = "AOTT-DB".toLowerCase
val filterString = lower('resourceId).like(s"/subscriptions/$subscriptionID/%/$workspaceDeploymentName")
display(
parsedEHDF
.filter(filterString)
)
Note that the following also exists. I'm not sure if its pattern is consistent but it's something to look into as we look further into this ticket.
aott-db
is the workspace name
@Sriram-databricks -- let's review perf differences in this and see if it makes sense (P1) -- if we cannot get it into 0.6.1.2 that's ok
I don't have concerns about egress costs - EventHubs always operates in terms of throughput units. But we should be careful about scheduling jobs at different times.
Really, I think that it makes sense to implement such feature when we implement support for running Overwatch outside of the monitored workspace - in this case we can have a job that will land all EventHubs data for all workspaces into Delta, and then run individual processes per workspace already on Delta
Will not use EH for this -- Investigate Kafka enabled EH can improve this
Kafka won't help much really. I think that putting multiple workspaces into same eventhubs should be linked to a case when one job will handle multiple workspaces - in this case we can land raw EH messages into partitioned Delta, and then consume from that Delta.
Not feasible since this would result in all data from other EHs on same EHNS being ingested and then filtered out significantly increasing bronze runtimes and costs for EH egress.
Re-opening this, for review. It's possible a single EH bronze land could be created for all workspaces in a multi-workspace deployment reducing the need for 10s - 100s of EHs for large customers with 10s-100s of workspaces.
Goal here is to review feasibility and prioritize.