service-fabric
service-fabric copied to clipboard
Managed clusters: Support storing stateful service data on local VM temp disk + Ephemeral OS disks
Storing state reliably locally (with automatic replication) on the virtual machines is powerful and truly differentiating feature of the Service Fabric. Combined with ephemeral OS disks feature Service Fabric clusters can function entirely without dependency to external storage.
As the SF team should well know this kind of system has multiple benefits including:
- Simplicity of infrastructure and minimal dependencies which often translates to less things that can go wrong and increased reliability
- Minimal read/write overhead and operation latency
- Cost benefits
- Data store scalability/performance depends on allocated local VM resources, not external store (which may provide varying levels of performance)
I find it difficult to understand why managed clusters have abandoned this core differentiating feature of Service Fabric. Yes, there are tradeoffs in each model, but the customer should have the choice. I came to Service Fabric because of the powerful stateful services story which i feel is somewhat crippled in managed clusters.
Please consider enabling managed cluster scenario where all data (and OS) is kept local to the VM.
Please correct me if I have understood something wrong about managed clusters.
@juho-hanhimaki where have you read that stateful services are not supported on SF Managed Clusters?
@olitomlinson I think you misunderstood me.
I am well aware stateful services are supported on managed clusters. The problem is that managed clusters store data in the Azure Storage instead of the local VM disk. That can be suboptimal for some scenarios / users.
With normal SF clusters you don't need Azure Storage because the data is replicated and kept available within the cluster itself.
@juho-hanhimaki My apologies. I’ve not come across this limitation in the docs, can you point me to it? Many thanks!
I don't know if there's actual documentation about this. But the managed clusters announcement blog post talks about the fact that the storage is now based on managed disks instead of the temp disk.
https://techcommunity.microsoft.com/t5/azure-service-fabric/azure-service-fabric-managed-clusters-are-now-in-public-preview/ba-p/1721572
The topic has also come up during SF community Q&As.
Thanks @juho-hanhimaki
I interpreted the blog post as additional support for Managed Disks.
But, yes, additional clarity would be nice here
Thank you for the feedback @juho-hanhimaki @olitomlinson
In the preview that is currently available, the data disk for Stateful Services on a managed cluster are only utilizing managed disk. We are working to enable support which will allow you to select the specific managed disk SKU in the near future.
I have added this work item to the backlog, and will update it when we have more information to share on support for using the VM temp disk for Stateful Services.
Thanks @peterpogorski
Couple of further questions
- Do you have data to share on the performance difference between VM temp disk vs Managed Storage?
As @juho-hanhimaki mentioned, the benefits of local data performance in the cluster are huge and are pretty much one of the biggest attractions/differentiators against other orchestrators.
Assuming there is a significant difference in latency here, and this is within tolerance of most customers, does this mean that read/write latency of state is no longer a killer feature for Service Fabric moving forwards?
- For stateful services, am I right in assuming that each node has its own Managed Disk? If that's the case, does that mean that E2E latency is now impacted two-fold :
- Time to achieve quorum of writes (as per non-managed cluster) across the cluster
- Time for durable / 2-phase commit writes inside of the Managed Disk?
- Are you placing any future bets here on disaggregated architecture (offered as Managed Disks) providing a comparable level of performance to local disk? As per the Mark Russinovich demo at Ignite 2020?
If so, I could understand the move towards Managed Disks being able to satisfy the local performance requirements that we expect with the traditional Service Fabric temp disk model.
- More important than any of the other questions - from a customer persona perspective, where is the Service Fabric proposition heading? What are the key differentiators that will set it apart from, say, k8s + Dapr, over the next few years?
@craftyhouse what's the plan to support ephemeral disks in SFMC?
What will the migration scenario look like once temp disk support is available?
We are working to add support for temp disks on stateless node types. The migration will require a new node type and to migrate the workload over.
We haven't seen any concerns since launch from customers around performance or latency of the managed disk, but would love to hear more if there has been any impact there. A lot of what makes SFMC work for stateful workloads relies on managed data disks and we will continue to expand those options too.
I'm actually more concerned about the additional cost of using managed disks.
How is SFMC different from plain old SF when it comes to the stateful workloads?
I should add. It was the disk operations cost from using Standard SSD disks that caught us by surprise. Looks like we should be able that cost under control by switching to Premium SSD. In which cases we can probably live with running our stateful workloads on managed disks.
I suppose support for Ephemeral OS disks is till interesting though.
There is no difference in how service fabric runtime handles the stateful workloads whether deployed using classic or managed services. By using managed disk SFMC is able to benefit customers by: • mitigation of VM down scenario where we can fully replace the underlying VM and attach an existing data disk without data loss and without customer intervention • safely speeding up reboot operations by eliminating the need to hydrate other nodes with stateful data for most scenarios • safely support a lower number of nodes for primary and secondary node types, especially in the case of a zonal resilient cluster • flexibility in disk sizing and performance separate from the VM SKU that aligns with the disaggregated architecture pattern
Hope that helps
On the topic of encryption at rest
SFMC support for temp disks will work with encrypted disks?
Encryption at host requires enabling a EncryptionAtHost
feature flag in the subscription. Is there a good reason to not enable this flag in the subscription?
Having stateful services use temporary disks is still a top priority for us.
We can't afford to pay extra for managed disks (assuming poor IOPS/$).
As I understand managed disks also hurt overall reliability since they are LRS. If a single availability zone is impacted data disks can be down, even if the cluster VM's are up in two other zones. Having VMs up with data disks down makes the whole cluster useless for any stateful workloads.
Managed clusters are not very interesting to us until temp disk for data and ephemeral OS disk are supported. We don't want to depend on any external disk service. VM's have all the resources locally and service fabric coordinates services/replication. Maximum performance, cost effectivity and availability.
Backups and long term storage (historical IoT data) can be done to external ZRS storage. Service availability is not dependent on external storage.
Please direct questions to one of the forums documented here https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-support#post-a-question-to-microsoft-qa
In short, as long as the restrictions do not apply to your scenario, we recommend using host encryption as called out in the documentation. Restrictions: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-enable-host-based-encryption-portal#restrictions
SFMC docs: https://docs.microsoft.com/en-us/azure/service-fabric/how-to-managed-cluster-enable-disk-encryption?tabs=azure-powershell#enable-encryption-at-host-preview
Having stateful services use temporary disks is still a top priority for us.
Classic is the current supported path for this but hear you that you want to use managed clusters and will continue to think about ways to address this.
We can't afford to pay extra for managed disks (assuming poor IOPS/$).
I did some rough math of VMSS skus with and without temp disk per month using the calculator. I'd be glad to discuss more offline if it would be helpful. In summary there are newer skus that are now GA that are cheaper than what was previously available as they do not have a temp disk.
Example VM SKU with temp disk (100GB) dv2 d2 v2 2CPU 7GB of ram = ~$184
dv5 d2 v5 2cpu 8GB of ram = ~$149 128GiB premium ssd managed disk $18
~$167 total compared to $184
With dv5 and managed disk you get more ram and storage, but you are correct there is a lower ceiling for iops. We haven't heard feedback where this has come in to play with real world workloads yet, but if you have any data to share that would be helpful.
As I understand managed disks also hurt overall reliability since they are LRS. If a single availability zone is impacted data disks can be down, even if the cluster VM's are up in two other zones. Having VMs up with data disks down makes the whole cluster useless for any stateful workloads.
Not sure I follow this concern about availability. Each VM has a managed disk and if you have a zone resilient cluster and experience a zone down (az01 goes down), az02/03 would still be up and fully operational given the VMs and disks are localized to each zone.
Managed clusters are not very interesting to us until temp disk for data and ephemeral OS disk are supported. We don't want to depend on any external disk service. VM's have all the resources locally and service fabric coordinates services/replication. Maximum performance, cost effectivity and availability.
Backups and long term storage (historical IoT data) can be done to external ZRS storage. Service availability is not dependent on external storage.
@craftyhouse
I asked the question earlier in this thread but I don’t think it ever got an answer.
“For stateful services, am I right in assuming that each node has its own Managed Disk?”
And now you’ve just confirmed with
“ Each VM has a managed disk ”
I strongly suspect that there is confusion here from people with experience of non-managed clusters, who have not yet understood that you still use one Managed Disk per Node. A Managed Disk is not shared across all nodes.
Are writes still performed and committed using the quorum semantics as per the same as non-managed clusters?
It might be worth making that point very explicit in the public documentation of Managed Clusters, as it’s never been clear to me until now.
I also asked the following question, but never got an answer
“Are you placing any future bets here on disaggregated architecture (offered as Managed Disks) providing a comparable level of performance to local disk? As per the Mark Russinovich demo at Ignite 2020? If so, I could understand the move towards Managed Disks being able to satisfy the local performance requirements that we expect with the traditional Service Fabric temp disk model.”
I would still hypothesise that the disaggregated architecture innovation would raise the ceiling latency of what is possible with Managed Disks…? Any noise coming from Azure on disaggregated architecture being utilised yet?
@craftyhouse
I asked the question earlier in this thread but I don’t think it ever got an answer.
“For stateful services, am I right in assuming that each node has its own Managed Disk?”
And now you’ve just confirmed with
“ Each VM has a managed disk ”
I strongly suspect that there is confusion here from people with experience of non-managed clusters, who have not yet understood that you still use one Managed Disk per Node. A Managed Disk is not shared across all nodes.
Ah, I see. I agree it would be helpful to show how we wire it up to help clarify. A diagram depicting disk>vm>nodetype>cluster relationship.
In text form, I have a managed cluster with two node types. NT2 here as an example: Node - Managed disk NT2_0 - NT2_NT2_1_OsDisk_1_170875e14848425c97cafa8ac9bacc94 NT2_1 - NT2_NT2_0_OsDisk_1_4a5e3308c7024ab9a96925d44150c835
As you can see, they are unique disks. We support creating and attaching many per VM (a...z basically) with the latest preview api.
Are writes still performed and committed using the quorum semantics as per the same as non-managed clusters?
SFMC does not modify the way Service Fabric runtime behaves and leverages the exact same bits. The semantics that you are familiar with are still the same.
It might be worth making that point very explicit in the public documentation of Managed Clusters, as it’s never been clear to me until now.
Ack :). Thank you
I also asked the following question, but never got an answer
“Are you placing any future bets here on disaggregated architecture (offered as Managed Disks) providing a comparable level of performance to local disk? As per the Mark Russinovich demo at Ignite 2020? If so, I could understand the move towards Managed Disks being able to satisfy the local performance requirements that we expect with the traditional Service Fabric temp disk model.”
I would still hypothesise that the disaggregated architecture innovation would raise the ceiling latency of what is possible with Managed Disks…? Any noise coming from Azure on disaggregated architecture being utilised yet?
I did some rough math of VMSS skus with and without temp disk per month using the calculator. I'd be glad to discuss more offline if it would be helpful. In summary there are newer skus that are now GA that are cheaper than what was previously available as they do not have a temp disk.
Example VM SKU with temp disk (100GB) dv2 d2 v2 2CPU 7GB of ram = ~$184
dv5 d2 v5 2cpu 8GB of ram = ~$149 128GiB premium ssd managed disk $18
~$167 total compared to $184
With dv5 and managed disk you get more ram and storage, but you are correct there is a lower ceiling for iops. We haven't heard feedback where this has come in to play with real world workloads yet, but if you have any data to share that would be helpful.
We haven't had the time to benchmark managed clusters yet. But cost/perf disadvantage seems obvious.
SF: Standard_D2ds_v5 2cpu 8GB ram 75 GB Temp disk 9000 IOPS 125 MBps Ephemeral OS disk $159
SF total per VM: $159
SFMC: Standard_D2s_v5 2cpu 8GB ram NO temp disk NO ephemeral OS disk ~$145
OS disk Standard SSD E4 ~$2,5
Data disk Premium SSD P40 7500 IOPS 250 MBps $259
SFMC total per VM: $406,5
As you can see the SFMC costs over twice as much and still has less IOPS. Our workload requires performant disk (a lot of IIoT devices constantly logging). SFMC makes no sense unless there is something obvious wrong with my napkin math.
The non-temp disk variant of V5 VM is 14 dollars cheaper. 14 dollars is not enough for performant premium storage. Only shitty one.
Of course we could try and see what we get from the cheapish P10 (500 IOPS) disk. But my assumption would be that our ability to process data would be severely degraded compared to V5 VM temp disk. And still we would pay few extra dollars. Considering the P10 disk itself is replicated storage, I'd find it really strange if it was on par with simple physical local storage on the VM.
There is now an useEphemeralOSDisk property on SFMC https://learn.microsoft.com/en-us/azure/service-fabric/how-to-managed-cluster-ephemeral-os-disks
The only problem is
https://learn.microsoft.com/en-us/dotnet/api/azure.resourcemanager.servicefabricmanagedclusters.servicefabricmanagednodetypedata.datadiskletter?view=azure-dotnet
Managed data disk letter. It can not use the reserved letter C or D and it can not change after created.
So we still need a managed disk for the statefull services :(