If you have worked with Azure in the past, you might have been aware that Azure didn’t have live migration for VMs hosted in Azure for a long time. This had an impact for customers in terms of VM up-time during host maintenance. You basically got emails, that the host your VMs were running is going into maintenance during a specific time, and you will have a possible outage. Microsoft Hyper-V, which is the Hypervisor in Azure, had Live Migration for a long time. Today, Microsoft revealed that they are using Live Migration in Azure since early 2018 to move virtual machines in cases of rack maintenance and software and BIOS updates, as well as hardware faults.
But Microsoft didn’t stop there, they made even better using Machine Learning. Predictive ML helps Microsoft to detect proactively failure and do failure predictions. And in case a hardware failure is predicted, Microsoft can move the virtual machines from that host without downtime, using live migration.
To further push the envelope on live migration, we knew we needed to look at the proactive use of these capabilities, based on good predictive signals. Using our deep fleet telemetry, we enabled machine learning (ML)-based failure predictions and tied them to automatic live migration for several hardware failure cases, including disk failures, IO latency, and CPU frequency anomalies.
We partnered with Microsoft Research (MSR) on building our ML models that predict failures with a high degree of accuracy before they occur. As a result, we’re able to live migrate workloads off “at-risk” machines before they ever show any signs of failing. This means VMs running on Azure can be more reliable than the underlying hardware.
Microsoft talks in a blog post more about Live Migration in Azure and goes more in details about the challenges and how live migration in Azure works. It is great to see Microsoft adding features to improve VM resiliency with features like live migration and machine learning technology.