Are you using Azure Site Recovery (ASR) and Windows Server 2019 VM’s? There is a big change that you are unprotected right now and failover will end up in BSOD.
Over the last couple of months we were involved in multiple projects to modernize customer datacenters with Azure Stack HCI. On top of that enabling Azure features like Azure Backup and Azure Site Recovery (ASR). During failover tests we ran into a problem with Windows Server 2019 VMs.
Case 1: A regular ASR Test failover
In the beginning of February 2021 we completed a migration from a Hyper-V environment to a new a Azure Stack HCI solution with Azure Site Recovery as disaster recovery solution. All VMs (most of them running Windows Server 2019) were migrated to Azure Stack HCI. The business critical VMs were enabled for protected with ASR. During the test failover all VMs were booted successfully in Azure. Then tested by the customer for availability and functionality.
The failover test was a success, but due to a migration of a core application, the customer scheduled an additional fail over test to include the new application when their migration was finished. But more on that later.
Case 2: Test failover in an error
In this case a very similar situation to the first case. Migration to Azure Stack HCI environment and all business critical VMs protected with ASR. In this case there was only one VM (a domain controller) on Windows Server 2019. The others VMs were running a variety of 2016, 2012 R2 an 2012.
Around the beginning of April 2021 we scheduled and executed a test fail over to test ASR. Almost all VMs booted successfully in Azure after the test failover was completed, except one VM. In this case the Windows Server 2019 VM keeps booting into the Windows Boot Manager with an error.
After some research and several reboots of the VM it seems to boot to Windows but is struck by a BSOD on WDF01000.sys. And we all know, that’s not good…
After the BSOD the VM boots into the Windows Boot manager and is stuck there. In this case the customer suggested to investigate him self and we left it at that. Waiting to proceed with the failover test when the issue was investigated by the customer.
Back to Case 1
In the mean while the application was migrated to new VMs and the customer requested an additional Azure Site Recovery failover test. This time it was not going that well. After the test failover all Windows Server 2019 VMs booted into the Windows Boot Manager with the same error code as we seen in our second case.
We noticed the same BSOD on WDF01000.sys. This time we could do the investigation and started downloading the VHD to retrieve the memory.dmp file. The memory dump pointed to vmstorfl.sys filter driver used as Virtual Storage Filter Driver on Windows VMs…
This was working 3 months back in the beginning of February, something has broke it for almost all Windows Server 2019 VMs.
What has happened?
When looking at the environment we noticed that after the successful failover test the February cumulative update was installed (KB4601345). With the February update the OS version is set to 10.0.17763.1757. In that update the VMstorfl.sys file is not updated and stays at version 10.0.17763.771. In the May Update the VMstorfl.sys file is updated to 10.0.17763.1911, so that means the file has changed. Unfortunately there is no documentation available and no mention of this issue in the February update. The same applies for the March and April updates including the May update that fixed the issue there is no mention of the issue.
We did some testing on our own in a lab environment to rule out third party antivirus, backup agents or any other software. We setup several VMs with a clean installation of Windows Server 2019 1809 VMs. As soon as the February, March or April update is installed the VMs experience the BSOD.
Update as soon as possible and test your DR solution
Due to these 2 cases and some lab testing on our own. We found out that there is something in the February Update that is causing Windows Server 2019 VMs to Bluescreen on failover to Azure.
With the May update (KB5003171) installed this issue is fixed. So to all companies out there depending on Azure Site Recovery as your datacenter disaster recovery solution. If you are using Windows Server 2019 VMs make sure you install the May update as soon as possible. Otherwise your Disaster recovery solutions does not work at all!
In addition to that, these kind of issue’s proof once again that it is vital to test your DR solution from time to time. Because you probability didn’t know that for the past 3 months your DR solution with Azure Site Recovery was useless for your Windows Server 2019 VMs with the February update applied. And it keeps being useless until the may update is installed and synced.
If you need any help or got questions please get in touch!