service-fabric icon indicating copy to clipboard operation
service-fabric copied to clipboard

Managed clusters: Support storing stateful service data on local VM temp disk + Ephemeral OS disks

Open juho-hanhimaki opened this issue 4 years ago • 21 comments

Storing state reliably locally (with automatic replication) on the virtual machines is powerful and truly differentiating feature of the Service Fabric. Combined with ephemeral OS disks feature Service Fabric clusters can function entirely without dependency to external storage.

As the SF team should well know this kind of system has multiple benefits including:

  • Simplicity of infrastructure and minimal dependencies which often translates to less things that can go wrong and increased reliability
  • Minimal read/write overhead and operation latency
  • Cost benefits
  • Data store scalability/performance depends on allocated local VM resources, not external store (which may provide varying levels of performance)

I find it difficult to understand why managed clusters have abandoned this core differentiating feature of Service Fabric. Yes, there are tradeoffs in each model, but the customer should have the choice. I came to Service Fabric because of the powerful stateful services story which i feel is somewhat crippled in managed clusters.

Please consider enabling managed cluster scenario where all data (and OS) is kept local to the VM.

Please correct me if I have understood something wrong about managed clusters.

juho-hanhimaki avatar Feb 19 '21 10:02 juho-hanhimaki

@juho-hanhimaki where have you read that stateful services are not supported on SF Managed Clusters?

olitomlinson avatar Feb 21 '21 20:02 olitomlinson

@olitomlinson I think you misunderstood me.

I am well aware stateful services are supported on managed clusters. The problem is that managed clusters store data in the Azure Storage instead of the local VM disk. That can be suboptimal for some scenarios / users.

With normal SF clusters you don't need Azure Storage because the data is replicated and kept available within the cluster itself.

juho-hanhimaki avatar Feb 21 '21 20:02 juho-hanhimaki

@juho-hanhimaki My apologies. I’ve not come across this limitation in the docs, can you point me to it? Many thanks!

olitomlinson avatar Feb 21 '21 22:02 olitomlinson

I don't know if there's actual documentation about this. But the managed clusters announcement blog post talks about the fact that the storage is now based on managed disks instead of the temp disk.

https://techcommunity.microsoft.com/t5/azure-service-fabric/azure-service-fabric-managed-clusters-are-now-in-public-preview/ba-p/1721572

The topic has also come up during SF community Q&As.

juho-hanhimaki avatar Feb 21 '21 23:02 juho-hanhimaki

Thanks @juho-hanhimaki

I interpreted the blog post as additional support for Managed Disks.

But, yes, additional clarity would be nice here

olitomlinson avatar Feb 22 '21 00:02 olitomlinson

Thank you for the feedback @juho-hanhimaki @olitomlinson

In the preview that is currently available, the data disk for Stateful Services on a managed cluster are only utilizing managed disk. We are working to enable support which will allow you to select the specific managed disk SKU in the near future.

I have added this work item to the backlog, and will update it when we have more information to share on support for using the VM temp disk for Stateful Services.

peterpogorski avatar Feb 22 '21 16:02 peterpogorski

Thanks @peterpogorski

Couple of further questions

  1. Do you have data to share on the performance difference between VM temp disk vs Managed Storage?

As @juho-hanhimaki mentioned, the benefits of local data performance in the cluster are huge and are pretty much one of the biggest attractions/differentiators against other orchestrators.

Assuming there is a significant difference in latency here, and this is within tolerance of most customers, does this mean that read/write latency of state is no longer a killer feature for Service Fabric moving forwards?

  1. For stateful services, am I right in assuming that each node has its own Managed Disk? If that's the case, does that mean that E2E latency is now impacted two-fold :
  • Time to achieve quorum of writes (as per non-managed cluster) across the cluster
  • Time for durable / 2-phase commit writes inside of the Managed Disk?
  1. Are you placing any future bets here on disaggregated architecture (offered as Managed Disks) providing a comparable level of performance to local disk? As per the Mark Russinovich demo at Ignite 2020?

If so, I could understand the move towards Managed Disks being able to satisfy the local performance requirements that we expect with the traditional Service Fabric temp disk model.

  1. More important than any of the other questions - from a customer persona perspective, where is the Service Fabric proposition heading? What are the key differentiators that will set it apart from, say, k8s + Dapr, over the next few years?

olitomlinson avatar Feb 22 '21 17:02 olitomlinson

@craftyhouse what's the plan to support ephemeral disks in SFMC?

abatishchev avatar Oct 04 '21 22:10 abatishchev

What will the migration scenario look like once temp disk support is available?

JohnNilsson avatar Oct 22 '21 13:10 JohnNilsson

We are working to add support for temp disks on stateless node types. The migration will require a new node type and to migrate the workload over.

We haven't seen any concerns since launch from customers around performance or latency of the managed disk, but would love to hear more if there has been any impact there. A lot of what makes SFMC work for stateful workloads relies on managed data disks and we will continue to expand those options too.

craftyhouse avatar Nov 09 '21 00:11 craftyhouse

I'm actually more concerned about the additional cost of using managed disks.

How is SFMC different from plain old SF when it comes to the stateful workloads?

JohnNilsson avatar Nov 15 '21 12:11 JohnNilsson

I should add. It was the disk operations cost from using Standard SSD disks that caught us by surprise. Looks like we should be able that cost under control by switching to Premium SSD. In which cases we can probably live with running our stateful workloads on managed disks.

I suppose support for Ephemeral OS disks is till interesting though.

JohnNilsson avatar Nov 16 '21 15:11 JohnNilsson

There is no difference in how service fabric runtime handles the stateful workloads whether deployed using classic or managed services. By using managed disk SFMC is able to benefit customers by: • mitigation of VM down scenario where we can fully replace the underlying VM and attach an existing data disk without data loss and without customer intervention • safely speeding up reboot operations by eliminating the need to hydrate other nodes with stateful data for most scenarios • safely support a lower number of nodes for primary and secondary node types, especially in the case of a zonal resilient cluster • flexibility in disk sizing and performance separate from the VM SKU that aligns with the disaggregated architecture pattern

Hope that helps

craftyhouse avatar Nov 17 '21 02:11 craftyhouse

On the topic of encryption at rest

SFMC support for temp disks will work with encrypted disks?

Encryption at host requires enabling a EncryptionAtHost feature flag in the subscription. Is there a good reason to not enable this flag in the subscription?

JohnNilsson avatar Dec 08 '21 13:12 JohnNilsson

Having stateful services use temporary disks is still a top priority for us.

We can't afford to pay extra for managed disks (assuming poor IOPS/$).

As I understand managed disks also hurt overall reliability since they are LRS. If a single availability zone is impacted data disks can be down, even if the cluster VM's are up in two other zones. Having VMs up with data disks down makes the whole cluster useless for any stateful workloads.

Managed clusters are not very interesting to us until temp disk for data and ephemeral OS disk are supported. We don't want to depend on any external disk service. VM's have all the resources locally and service fabric coordinates services/replication. Maximum performance, cost effectivity and availability.

Backups and long term storage (historical IoT data) can be done to external ZRS storage. Service availability is not dependent on external storage.

juho-hanhimaki avatar Dec 08 '21 13:12 juho-hanhimaki

Please direct questions to one of the forums documented here https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-support#post-a-question-to-microsoft-qa

In short, as long as the restrictions do not apply to your scenario, we recommend using host encryption as called out in the documentation. Restrictions: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-enable-host-based-encryption-portal#restrictions

SFMC docs: https://docs.microsoft.com/en-us/azure/service-fabric/how-to-managed-cluster-enable-disk-encryption?tabs=azure-powershell#enable-encryption-at-host-preview

craftyhouse avatar Dec 08 '21 23:12 craftyhouse

Having stateful services use temporary disks is still a top priority for us.

Classic is the current supported path for this but hear you that you want to use managed clusters and will continue to think about ways to address this.

We can't afford to pay extra for managed disks (assuming poor IOPS/$).

I did some rough math of VMSS skus with and without temp disk per month using the calculator. I'd be glad to discuss more offline if it would be helpful. In summary there are newer skus that are now GA that are cheaper than what was previously available as they do not have a temp disk.

Example VM SKU with temp disk (100GB) dv2 d2 v2 2CPU 7GB of ram = ~$184

dv5 d2 v5 2cpu 8GB of ram = ~$149 128GiB premium ssd managed disk $18

~$167 total compared to $184

With dv5 and managed disk you get more ram and storage, but you are correct there is a lower ceiling for iops. We haven't heard feedback where this has come in to play with real world workloads yet, but if you have any data to share that would be helpful.

As I understand managed disks also hurt overall reliability since they are LRS. If a single availability zone is impacted data disks can be down, even if the cluster VM's are up in two other zones. Having VMs up with data disks down makes the whole cluster useless for any stateful workloads.

Not sure I follow this concern about availability. Each VM has a managed disk and if you have a zone resilient cluster and experience a zone down (az01 goes down), az02/03 would still be up and fully operational given the VMs and disks are localized to each zone.

Managed clusters are not very interesting to us until temp disk for data and ephemeral OS disk are supported. We don't want to depend on any external disk service. VM's have all the resources locally and service fabric coordinates services/replication. Maximum performance, cost effectivity and availability.

Backups and long term storage (historical IoT data) can be done to external ZRS storage. Service availability is not dependent on external storage.

craftyhouse avatar Dec 08 '21 23:12 craftyhouse

@craftyhouse

I asked the question earlier in this thread but I don’t think it ever got an answer.

“For stateful services, am I right in assuming that each node has its own Managed Disk?”

And now you’ve just confirmed with

“ Each VM has a managed disk ”

I strongly suspect that there is confusion here from people with experience of non-managed clusters, who have not yet understood that you still use one Managed Disk per Node. A Managed Disk is not shared across all nodes.

Are writes still performed and committed using the quorum semantics as per the same as non-managed clusters?

It might be worth making that point very explicit in the public documentation of Managed Clusters, as it’s never been clear to me until now.

I also asked the following question, but never got an answer

“Are you placing any future bets here on disaggregated architecture (offered as Managed Disks) providing a comparable level of performance to local disk? As per the Mark Russinovich demo at Ignite 2020? If so, I could understand the move towards Managed Disks being able to satisfy the local performance requirements that we expect with the traditional Service Fabric temp disk model.”

I would still hypothesise that the disaggregated architecture innovation would raise the ceiling latency of what is possible with Managed Disks…? Any noise coming from Azure on disaggregated architecture being utilised yet?

olitomlinson avatar Dec 09 '21 00:12 olitomlinson

@craftyhouse

I asked the question earlier in this thread but I don’t think it ever got an answer.

“For stateful services, am I right in assuming that each node has its own Managed Disk?”

And now you’ve just confirmed with

“ Each VM has a managed disk ”

I strongly suspect that there is confusion here from people with experience of non-managed clusters, who have not yet understood that you still use one Managed Disk per Node. A Managed Disk is not shared across all nodes.

Ah, I see. I agree it would be helpful to show how we wire it up to help clarify. A diagram depicting disk>vm>nodetype>cluster relationship.

In text form, I have a managed cluster with two node types. NT2 here as an example: Node - Managed disk NT2_0 - NT2_NT2_1_OsDisk_1_170875e14848425c97cafa8ac9bacc94 NT2_1 - NT2_NT2_0_OsDisk_1_4a5e3308c7024ab9a96925d44150c835

As you can see, they are unique disks. We support creating and attaching many per VM (a...z basically) with the latest preview api.

Are writes still performed and committed using the quorum semantics as per the same as non-managed clusters?

SFMC does not modify the way Service Fabric runtime behaves and leverages the exact same bits. The semantics that you are familiar with are still the same.

It might be worth making that point very explicit in the public documentation of Managed Clusters, as it’s never been clear to me until now.

Ack :). Thank you

I also asked the following question, but never got an answer

“Are you placing any future bets here on disaggregated architecture (offered as Managed Disks) providing a comparable level of performance to local disk? As per the Mark Russinovich demo at Ignite 2020? If so, I could understand the move towards Managed Disks being able to satisfy the local performance requirements that we expect with the traditional Service Fabric temp disk model.”

I would still hypothesise that the disaggregated architecture innovation would raise the ceiling latency of what is possible with Managed Disks…? Any noise coming from Azure on disaggregated architecture being utilised yet?

craftyhouse avatar Dec 09 '21 01:12 craftyhouse

I did some rough math of VMSS skus with and without temp disk per month using the calculator. I'd be glad to discuss more offline if it would be helpful. In summary there are newer skus that are now GA that are cheaper than what was previously available as they do not have a temp disk.

Example VM SKU with temp disk (100GB) dv2 d2 v2 2CPU 7GB of ram = ~$184

dv5 d2 v5 2cpu 8GB of ram = ~$149 128GiB premium ssd managed disk $18

~$167 total compared to $184

With dv5 and managed disk you get more ram and storage, but you are correct there is a lower ceiling for iops. We haven't heard feedback where this has come in to play with real world workloads yet, but if you have any data to share that would be helpful.

We haven't had the time to benchmark managed clusters yet. But cost/perf disadvantage seems obvious.

SF: Standard_D2ds_v5 2cpu 8GB ram 75 GB Temp disk 9000 IOPS 125 MBps Ephemeral OS disk $159

SF total per VM: $159

SFMC: Standard_D2s_v5 2cpu 8GB ram NO temp disk NO ephemeral OS disk ~$145

OS disk Standard SSD E4 ~$2,5

Data disk Premium SSD P40 7500 IOPS 250 MBps $259

SFMC total per VM: $406,5

As you can see the SFMC costs over twice as much and still has less IOPS. Our workload requires performant disk (a lot of IIoT devices constantly logging). SFMC makes no sense unless there is something obvious wrong with my napkin math.

The non-temp disk variant of V5 VM is 14 dollars cheaper. 14 dollars is not enough for performant premium storage. Only shitty one.

Of course we could try and see what we get from the cheapish P10 (500 IOPS) disk. But my assumption would be that our ability to process data would be severely degraded compared to V5 VM temp disk. And still we would pay few extra dollars. Considering the P10 disk itself is replicated storage, I'd find it really strange if it was on par with simple physical local storage on the VM.

juho-hanhimaki avatar Dec 09 '21 08:12 juho-hanhimaki

There is now an useEphemeralOSDisk property on SFMC https://learn.microsoft.com/en-us/azure/service-fabric/how-to-managed-cluster-ephemeral-os-disks

The only problem is

https://learn.microsoft.com/en-us/dotnet/api/azure.resourcemanager.servicefabricmanagedclusters.servicefabricmanagednodetypedata.datadiskletter?view=azure-dotnet

Managed data disk letter. It can not use the reserved letter C or D and it can not change after created.

So we still need a managed disk for the statefull services :(

rfcdejong avatar Oct 25 '22 08:10 rfcdejong