Posts Taged disaster-recovery

Windows Update causes BSOD on Failover with Azure Site Recovery

Are you using Azure Site Recovery (ASR) and Windows Server 2019 VM’s? There is a big change that you are unprotected right now and failover will end up in BSOD.

Over the last couple of months we were involved in multiple projects to modernize customer datacenters with Azure Stack HCI. On top of that enabling Azure features like Azure Backup and Azure Site Recovery (ASR). During failover tests we ran into a problem with Windows Server 2019 VMs.

Case 1: A regular ASR Test failover

In the beginning of February 2021 we completed a migration from a Hyper-V environment to a new a Azure Stack HCI solution with Azure Site Recovery as disaster recovery solution. All VMs (most of them running Windows Server 2019) were migrated to Azure Stack HCI. The business critical VMs were enabled for protected with ASR. During the test failover all VMs were booted successfully in Azure. Then tested by the customer for availability and functionality.

The failover test was a success, but due to a migration of a core application, the customer scheduled an additional fail over test to include the new application when their migration was finished. But more on that later.

Case 2: Test failover in an error

In this case a very similar situation to the first case. Migration to Azure Stack HCI environment and all business critical VMs protected with ASR. In this case there was only one VM (a domain controller) on Windows Server 2019. The others VMs were running a variety of 2016, 2012 R2 an 2012.

Around the beginning of April 2021 we scheduled and executed a test fail over to test ASR. Almost all VMs booted successfully in Azure after the test failover was completed, except one VM. In this case the Windows Server 2019 VM keeps booting into the Windows Boot Manager with an error.

After some research and several reboots of the VM it seems to boot to Windows but is struck by a BSOD on WDF01000.sys. And we all know, that’s not good…

After the BSOD the VM boots into the Windows Boot manager and is stuck there. In this case the customer suggested to investigate him self and we left it at that. Waiting to proceed with the failover test when the issue was investigated by the customer.

Back to Case 1

In the mean while the application was migrated to new VMs and the customer requested an additional Azure Site Recovery failover test. This time it was not going that well. After the test failover all Windows Server 2019 VMs booted into the Windows Boot Manager with the same error code as we seen in our second case.

We noticed the same BSOD on WDF01000.sys. This time we could do the investigation and started downloading the VHD to retrieve the memory.dmp file. The memory dump pointed to vmstorfl.sys filter driver used as Virtual Storage Filter Driver on Windows VMs…

This was working 3 months back in the beginning of February, something has broke it for almost all Windows Server 2019 VMs.

What has happened?

When looking at the environment we noticed that after the successful failover test the February cumulative update was installed (KB4601345). With the February update the OS version is set to 10.0.17763.1757. In that update the VMstorfl.sys file is not updated and stays at version 10.0.17763.771. In the May Update the VMstorfl.sys file is updated to 10.0.17763.1911, so that means the file has changed. Unfortunately there is no documentation available and no mention of this issue in the February update. The same applies for the March and April updates including the May update that fixed the issue there is no mention of the issue.

We did some testing on our own in a lab environment to rule out third party antivirus, backup agents or any other software. We setup several VMs with a clean installation of Windows Server 2019 1809 VMs. As soon as the February, March or April update is installed the VMs experience the BSOD. 

Update as soon as possible and test your DR solution

Due to these 2 cases and some lab testing on our own. We found out that there is something in the February Update that is causing Windows Server 2019 VMs to Bluescreen on failover to Azure.

With the May update (KB5003171) installed this issue is fixed. So to all companies out there depending on Azure Site Recovery as your datacenter disaster recovery solution. If you are using Windows Server 2019 VMs make sure you install the May update as soon as possible. Otherwise your Disaster recovery solutions does not work at all!

In addition to that, these kind of issue’s proof once again that it is vital to test your DR solution from time to time. Because you probability didn’t know that for the past 3 months your DR solution with Azure Site Recovery was useless for your Windows Server 2019 VMs with the February update applied. And it keeps being useless until the may update is installed and synced.

If you need any help or got questions please get in touch!

Azure Stack HCI local vs stretched volume performance

Azure Stack HCI OS Stretched Clustering

One of the great new features in Azure Stack HCI OS is stretched clustering support for Storage Spaces Direct (S2D). With stretched clustering you can place S2D nodes in different sites for high availability and disaster recovery.

While in Windows Server 2019 you already have the ability to use stretched clustering, it was not yet possible with S2D enabled hosts. With the arrival of Azure Stack HCI OS, there is no holding back and we can now benefit from stretched clustering for hyper-converged systems!

As you might have heard or read about here, Azure Stack HCI will move forward as a new operating system with integration to Azure. In the new Azure Stack HCI OS there are lots of new features that we tried before and during public preview. It is important to understand that we did the testing with a preview version that is released as Azure Stack HCI 20H1. Performance on the GA version can be different.

Stretched clustering for HCI, is a very welcome feature and requested by a lot of customers for a long time. But are there differences in performance in compare to single site clusters? We too were curious about the performance differences and did some testing.

Stretched Volumes

Before we start testing, first a little bit of background info. When hyper converged nodes are stretched across 2 sites, you have the ability to stretch the volumes across the sites. While it seems like there is only one volume, if you dive below the surface you will see multiple volumes. Only one volume, the primary volume, is accessible from both sites at the same time. The secondary volume in the other site is standby for access and only receiving changes from the primary volume. This is just like in any other stretched or metro cluster solution. When disaster strikes the primary volume goes offline and the replica is brought online in the other site. The VMs fail over and start so the applications are accessible again.

When you create a stretched cluster and are ready to deploy volumes, you have 2 options: you can either create an asynchronous or synchronous volume. More info on which option to choose is described in the next chapters.

Asynchronous

With an “Asynchronous” volume the system accepts a write on the primary volume and responses back with an acknowledgement to the application after it is written. The system then tries to synchronize the change to the replica volume as fast as possible. It could be milliseconds or seconds later that the replication is finished. Depending on the amount of changes and intervals of the system, we could lose x amount of changes that already have been written to the primary volume but not yet to the replica volume, in case of a failure of the primary site.

Synchronous

A volume that is setup as “Synchronous Volume” will respond to the application with an acknowledgement after it has been written in both sites. The write is accepted by a node and copied to the other site. When both blocks have been written, the application will receive an acknowledgement from the storage. When the primary site fails there is no data loss since it’s in sync with the secondary site.

Topology

To give a better understanding of what our test setup looks like we provide some extra information.

In this case we have 4 servers that only contain flash drives. The servers are physically in the same rack but we simulated 2 sites based on 2 subnets. The primary site is called Amsterdam, the secondary site is called Utrecht.

In this setup the servers from both sites are in the same rack and cable distance is only meters instead of several kilometers or miles. So there is no additional latency because of the distance between the sites. That is important to keep in mind.

Both sites contain 2 servers and each server has:

– One volume that is not replicated to the other site but only between the nodes in the same site.
– One stretched synchronous volume
– One stretched asynchronous volume

Per server we have a total of 3 volumes and on each volume we deployed 10 VMs for testing.

Testing the setup

We use VMFleet and DiskSPD to test the performance of the volumes. With these tools we can quickly create a large amount of VMs and disks that we use for testing. When the VMs are deployed you can start the load tests on all the VMs simultaneous with a single command. During our tests we used the following test parameters:

  • Outstanding IO: 16

  • Block Size : 4k

  • Threads : 8

  • Write: 0% / 30% / 100%

Local Volumes Tests

First we start testing with the local volumes and boot the 40 VMs (10 VMs per volume) on the local volumes. Then we conduct the three tests based on zero writes, 30% writes and 100% writes. The results can be seen below.

Synchronous Stretched Volumes Test

Next, we tested the VMs that are deployed on the stretched volumes with synchronous replication. Like before we only start the 40 VMs deployed on the stretched volumes and start the same tests as before.

Asynchronous Stretched Volume Tests

For our last test we also use an stretched volume, but this time we used an Asynchronous volume. Again we only use the 40 VMs that are located on this volume and run the same tests.

Conclusion

To wrap things up, we have put all our results from the tables above in a diagram. Now we can visualize the difference between the various types of volumes. As you can see in the diagram, there is almost no difference between the types of volumes when we only read data.

The difference is starting to show when we start writing data. The synchronous and asynchronous volumes differences are huge compared to the local volumes. Considering these systems are next to each other, it will be worse when there is, for example, 50 km of fiber connection between the sites. 

Note: The tests above were conducted with a 4k block size, which is considered the most intensive size for the logs to keep up. Using an 8k or 16k block size, which are considered more regular workloads, their will be less difference between the local and replicated volumes.

Stretched clustering is a great way to improve availability for hyper-converged clusters. Although the preview build performance results are not satisfying enough. It’s good we test this in early preview stages of the Azure Stack HCI OS, so the product gets the improvements it needs before Azure Stack HCI OS gets to GA.

If you have any questions or want more information about Azure Stack HCI OS, or stretched clustering let us know! We are happy to assist! 

    Terms and Conditions