Disaster Recovery and Business Continuity in the Microsoft Cloud

Written by aeadmin | Jul 12, 2016 4:31:56 PM

A disaster recovery and business continuity plan is critical to any organization’s IT operations. By designing and implementing a viable strategy, organizations can ensure systems remain available during planned or unplanned maintenance and can be recovered in the event of a system failure. Azure Site Recovery Services (ASR) allow organizations to extend their business continuity and disaster recovery strategies to the Microsoft Cloud.

This service is capable of orchestrating the replication and failover of virtual machines and physical servers. By leveraging the features of Azure Recovery Services, enterprises and small businesses alike can efficiently implement a resilient disaster recovery solution in the cloud.

The focus of this article is not to detail the step-by-step configuration of Azure Site Recovery Services, but rather to highlight various features and components that comprise the solution. The article will also reference some guidelines to consider when planning a deployment.

Site Recovery Scenarios

Azure Recovery Service is not limited to a single platform and can be implemented to protect mixed environments containing both physical and virtual resources. In a full-scale, hybrid deployment recovery services can span beyond the Microsoft Cloud and be configured to integrate with an organization’s secondary site or datacenter.

Below is a List of Supported Scenarios:

Protect VMware virtual machines: You can protect on-premises VMware virtual machines by replicating them to Azure or to a secondary datacenter.
Protect Hyper-V VMs: You can protect on-premises Hyper-V virtual machines in VMM clouds by replicating them to Azure or to a secondary datacenter. You can replicate Hyper-V VMs that aren't managed by VMM to Azure.
Protect physical servers to Azure: You can protect physical machines running Windows or Linux by replicating to Azure or to a secondary datacenter.
Migrate VMs: You can use Site Recovery to migrate Azure IaaS VMs between regions, or to migrate AWS Windows instances to Azure IaaS VMs.
How Azure Site Recovery Works

The Azure Site Recovery tools reside in Azure and remotely monitor servers in a datacenter on an ongoing basis. Agents are installed on-premises and a security key is applied to establish the connection between the two environments. ASR also features encryption capabilities to meet an organization’s security requirements.

Recovery Plans, which reside in Azure, dictate how the recovery services will respond in the event of an outage such as, which server and service to restore first and how fast. Recovery plans can be simple with basic settings or highly customized scenarios using PowerShell scripts. Unlike traditional disaster recovery environments, ASR features allows recovery plans to be tested without causing disruptions in the operational infrastructure. Testing is noninvasive and can be done without the cost, complexity, and downtime of a traditional DR test.

Although replicating to Azure requires the configuration of a Site Recovery vault, the failover task automatically provisions the VMs when initiated. Once a VM is replicated, the machine type and settings can be modified to the appropriate size(s) required for running the workload. This architecture provides a cost benefit since the fees associated with the resources required to run an active VM are not being incurred. There is an additional cost savings on licensing fees for Microsoft workloads through DR benefits covered under Microsoft's Software Assurance (SA). For each licensed instance customers run, the SA allows them to run one instance of the software on a backup server for disaster recovery.

Additional Features and Benefits:

First 31 days are free allowing for testing or utilized to migrate workloads to Azure
Pay as you go - No upfront hardware to purchase
Eliminates the need for a secondary datacenter
Scalable resources are readily available
Flexible, “single-click” recovery plans
Support for HyperV, VMware and Physical servers
Near-synchronous replication with RPO as low as 30 seconds
App-Consistent snapshots for single or N-tier applications
Integration with SQL Server AlwaysOn
Site Recovery can replicate most applications running on VMs and physical servers. A full summary of the supported apps be found using the following link: What workloads can Azure Site Recovery protect?

SLA Summary

Microsoft guarantees a 99.9% availability of the Site Recovery Service for each protected instance configured for On-Premises-to-On-Premises Failover. For each protected instance configured for On-Premises-to-Azure planned and unplanned Failover, Microsoft guarantees a four-hour Recovery Time Objective for unencrypted Protected Instances and a six-hour recovery time objective for encrypted protected instance, depending on the size of the protected instance.

The Azure Portal

Microsoft Azure is rapidly changing and services are being updated on a regular basis. As mentioned in a previous blog post, “Getting Started with Azure,” there are currently 2 separate web portals for managing Azure services: “Classic” and the new “Preview Portal”. Some features are still only available in the classic portal; However, they are gradually being incorporated into the new portal. Services and workloads deployed in the new preview portal are also referred to as Azure Resource Manager (ARM) mode.

Initially, Azure Site Recovery was only supported as a service within the classic portal. Today, recovery services are fully supported in the new portal. Microsoft recommends any new deployments be configured in the Preview Portal / ARM mode. Note that the feature is named “Site Recovery Vaults” in the classic portal and “Recovery Services Vaults” in the ARM mode. (Fig 1)

Fig 1: Recover Service Vaults (New Portal) & Site Recovery Vaults (Classic)

Planning a Deployment

Prior to deploying Azure Site Recovery services to protect a workload, it is important to understand the configuration of the source environment. Network and system workflow diagrams are essential to the planning process. Any system dependencies (internal/external) should be documented. It should also be determined if the workload can temporarily operate in a depreciated state without access to these dependencies.

As part of any disaster recovery plan, the Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) need to be established based on business requirements and compared to Microsoft’s SLA for recovery services. In some cases, utilizing high availability features at the application level (such as SQL Always On) may be the best approach to meet business requirements.

Microsoft has various tools available to assist with assessing the source environment and planning the deployment. Links to the tools are found below:

Azure Recovery Capacity Planning Tool:
https://gallery.technet.microsoft.com/Azure-Recovery-Capacity-d01dc40e
Microsoft Assessment and Planning Toolkit:
https://www.microsoft.com/en-us/download/details.aspx?id=7826
Hyper-V Replica Capacity Planning Tool:
https://www.microsoft.com/en-us/download/details.aspx?id=39057

Read more about capacity planning for Site Recovery.

Resource Groups

A resource group is a container that holds related resources for an application such as Virtual Networks, VMs, and Storage Accounts. (Fig 2) The resource group could include all of the resources for an application, or only those resources that are logically grouped together. You can decide how you want to allocate resources to resource groups based on what makes the most sense for your organization.

Fig 2: Resource Group

During the initial set-up of a Recovery Vault, the option to deploy the vault in a new or existing resource group is available. Although the resource group can be configured at this time, it is recommended best practice to provision the resources prior to deploying the recovery vault. This ensures the resources are configured properly to support the service and allow for integration with the rest of the environment; both on-premises and in Azure.

It is important to note that the Recovery Services are location specific and can only interconnect with resources deployed in the same region.

Example: A recovery vault deployed in Central US region will not be able to connect to a network or storage resource that has been deployed in East US region.

Similarly, resources deployed in the classic portal cannot be interconnected with resources deployed in the new portal.

Example: Azure Recovery Vault created in the new portal cannot be configured to use a network provisioned in the classic portal even if it is deployed in the same region.

There are two primary resource types required for recovery services: Storage Accounts and Virtual Network.

Storage Account

A Geo Redundant Storage account (GRS) is a requirement for deploying Azure Recovery Services. GRS replicates data to a secondary region that is hundreds of miles away from the primary region allowing the data to be durable even in the case of a complete regional outage or a disaster in which the primary region is not recoverable. For a storage account with GRS enabled, an update is first committed to the primary region, where it is replicated three times. Then the update is replicated to the secondary region, where it is also replicated three times, across separate fault domains and upgrade domains. Note, the secondary region is automatically determined based on the primary region.

Fig 3. GRS Storage Account

Virtual Network (VNet)

Networking resources allow connectivity to a workload once it has been failed over to Azure. On-premise systems can access the Azure VMs via a site to site VPN connection deployed in the Azure network. By implementing a separate/stand-alone network, external connections can be made for testing workload failover and will have no impact on the production resources. Depending on the architecture, the deployment of a virtual network may require the creation of various objects. (Fig. 4) The figure below illustrates network objects that were created to define on-premise and Azure virtual networks.

Fig 4: Network objects in Resource Group

When configuring a VPN in Azure, it is important to note the differences between the two options available.

Static Routing VPN = Policy Based VPN
Dynamic Routing VPN = Route Based VPN
Static Routing VPNs require a Static VPN Gateway and do NOT support Multi-site VPN, VNet to VNet and Point-to-Site VPN.
Dynamic Routing VPNs requires a Dynamic VPN Gateway and fully supports Multi-site VPN, VNet to VNet and Point-to-Site VPN options.

Replication Policies and Recovery Plans

Replication Policies need to be created and applied to replication jobs to specify copy frequency and recovery point retention. (Fig 5) The recovery plans can be saved and reapplied for jobs with similar requirements. There are currently 3 options available for replication frequency intervals: 30 seconds, 5 minutes, or 15 minutes.

Fig 5 – Replication Policy Settings

Recovery plans consist of one or more ordered groups that contain virtual machines or physical servers that need to fail over together. Recovery plans do the following:

Define groups of machines that fail over and then start up together.
Model dependencies between machines by grouping them together in a recovery plan group. For example, if you want to fail over and bring up a specific application you would group the virtual machines for that application in the same recovery plan group.
Automate and extend failover. You can run a test, planned, or unplanned failover on a recovery plan. You can customize recovery plans with scripts, Azure automation, and manual actions.

Service Limits

Every Azure subscription comes with a set of default limits on cores, cloud services, etc. It is recommended to run a test failover to validate the availability of resources in a subscription. Limits can be modified via Azure support. The table below highlights standard limits with Azure Recovery Services.

LIMIT IDENTIFIER	DEFAULT LIMIT
Number of vaults per subscription	25
Number of servers per Azure vault	250
Number of protection groups per Azure vault	No limit
Number of recovery plans per Azure vault	No limit
Number of servers per protection group	No limit
Number of servers per recovery plan	50

Table 1: ASR Service Limits

Recovery Service Optimization

The following guidelines can help optimize and scale an ASR deployment:

Operating system volume size: The operating system disk of a VM replicating to Azure must be less than 1TB. Additional volumes can be manually moved to a different disk prior to deployment.
Data disk size: VM’s being replicated to Azure can have up to 32 data disks each with a maximum of 1 TB. Replication and failover tasks can be completed on a ~32 TB virtual machine.
Recovery plan limits: Site Recovery can scale to thousands of virtual machines. Recovery plans are designed as a model for applications that should fail over together and are limited to 50 servers per recovery plan.
Replication bandwidth: to address issues with replication bandwidth note that…
- ExpressRoute: Site Recovery works with Azure ExpressRoute and WAN optimizers such as Riverbed. Read more about ExpressRoute.
- Replication traffic: Site Recovery uses performs a smart initial replication using only data blocks and not the entire VHD. Only changes are replicated during ongoing replication.
- Network traffic: You can control network traffic used for replication by setting up Windows QoS with a policy based on the destination IP address and port. In addition, if you're replicating to Azure Site Recovery using the Azure Backup agent. You can configure throttling for that agent. Read more.
Recovery Time Objective (RTO): To measure the RTO you can expect with Site Recovery it is recommended to conduct a series of test failover jobs and view the Site Recovery logs to analyze how much time it takes to complete the operations. For the best RTO, when failing over into Azure, all manual actions should be automated by integrating with Azure automation and recovery plans.
Recovery Point Objective (RPO): Site Recovery supports a near-synchronous RPO when you replicate to Azure; assuming sufficient bandwidth between the datacenter and Azure.

Summary

Azure Site Recovery is a powerful solution offering a wide range of options to ensure an organization’s IT systems are highly available and protected in the event of a disaster. Understanding an organization’s business continuity requirements along with the various components and features of Azure Recovery Service will assist in planning a resilient DR plan.

Written and composed by our Microsoft Systems Engineer, Raul R. Perez II of AdaptivEdge

References
https://azure.microsoft.com/en-us/documentation/articles/site-recovery-best-practices/
https://azure.microsoft.com/en-us/support/legal/sla/site-recovery/v1_0/
https://azure.microsoft.com/en-us/documentation/articles/site-recovery-create-recovery-plans/
https://azure.microsoft.com/en-us/documentation/articles/site-recovery-workload/
https://blogs.msdn.microsoft.com/rslaten/2014/12/08/static-vs-dynamic-routing-gateways-in-azure/

View full post