A disaster recovery and business continuity plan is critical to any organization’s IT operations. By designing and implementing a viable strategy, organizations can ensure systems remain available during planned or unplanned maintenance and can be recovered in the event of a system failure. Azure Site Recovery Services (ASR) allow organizations to extend their business continuity and disaster recovery strategies to the Microsoft Cloud.
This service is capable of orchestrating the replication and failover of virtual machines and physical servers. By leveraging the features of Azure Recovery Services, enterprises and small businesses alike can efficiently implement a resilient disaster recovery solution in the cloud.
The focus of this article is not to detail the step-by-step configuration of Azure Site Recovery Services, but rather to highlight various features and components that comprise the solution. The article will also reference some guidelines to consider when planning a deployment.
Azure Recovery Service is not limited to a single platform and can be implemented to protect mixed environments containing both physical and virtual resources. In a full-scale, hybrid deployment recovery services can span beyond the Microsoft Cloud and be configured to integrate with an organization’s secondary site or datacenter.
Below is a List of Supported Scenarios:
The Azure Site Recovery tools reside in Azure and remotely monitor servers in a datacenter on an ongoing basis. Agents are installed on-premises and a security key is applied to establish the connection between the two environments. ASR also features encryption capabilities to meet an organization’s security requirements.
Recovery Plans, which reside in Azure, dictate how the recovery services will respond in the event of an outage such as, which server and service to restore first and how fast. Recovery plans can be simple with basic settings or highly customized scenarios using PowerShell scripts. Unlike traditional disaster recovery environments, ASR features allows recovery plans to be tested without causing disruptions in the operational infrastructure. Testing is noninvasive and can be done without the cost, complexity, and downtime of a traditional DR test.
Although replicating to Azure requires the configuration of a Site Recovery vault, the failover task automatically provisions the VMs when initiated. Once a VM is replicated, the machine type and settings can be modified to the appropriate size(s) required for running the workload. This architecture provides a cost benefit since the fees associated with the resources required to run an active VM are not being incurred. There is an additional cost savings on licensing fees for Microsoft workloads through DR benefits covered under Microsoft's Software Assurance (SA). For each licensed instance customers run, the SA allows them to run one instance of the software on a backup server for disaster recovery.
Additional Features and Benefits:
SLA Summary
Microsoft guarantees a 99.9% availability of the Site Recovery Service for each protected instance configured for On-Premises-to-On-Premises Failover. For each protected instance configured for On-Premises-to-Azure planned and unplanned Failover, Microsoft guarantees a four-hour Recovery Time Objective for unencrypted Protected Instances and a six-hour recovery time objective for encrypted protected instance, depending on the size of the protected instance.
Microsoft Azure is rapidly changing and services are being updated on a regular basis. As mentioned in a previous blog post, “Getting Started with Azure,” there are currently 2 separate web portals for managing Azure services: “Classic” and the new “Preview Portal”. Some features are still only available in the classic portal; However, they are gradually being incorporated into the new portal. Services and workloads deployed in the new preview portal are also referred to as Azure Resource Manager (ARM) mode.
Initially, Azure Site Recovery was only supported as a service within the classic portal. Today, recovery services are fully supported in the new portal. Microsoft recommends any new deployments be configured in the Preview Portal / ARM mode. Note that the feature is named “Site Recovery Vaults” in the classic portal and “Recovery Services Vaults” in the ARM mode. (Fig 1)
Fig 1: Recover Service Vaults (New Portal) & Site Recovery Vaults (Classic)
Prior to deploying Azure Site Recovery services to protect a workload, it is important to understand the configuration of the source environment. Network and system workflow diagrams are essential to the planning process. Any system dependencies (internal/external) should be documented. It should also be determined if the workload can temporarily operate in a depreciated state without access to these dependencies.
As part of any disaster recovery plan, the Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) need to be established based on business requirements and compared to Microsoft’s SLA for recovery services. In some cases, utilizing high availability features at the application level (such as SQL Always On) may be the best approach to meet business requirements.
Microsoft has various tools available to assist with assessing the source environment and planning the deployment. Links to the tools are found below:
Read more about capacity planning for Site Recovery.
A resource group is a container that holds related resources for an application such as Virtual Networks, VMs, and Storage Accounts. (Fig 2) The resource group could include all of the resources for an application, or only those resources that are logically grouped together. You can decide how you want to allocate resources to resource groups based on what makes the most sense for your organization.
Fig 2: Resource Group
During the initial set-up of a Recovery Vault, the option to deploy the vault in a new or existing resource group is available. Although the resource group can be configured at this time, it is recommended best practice to provision the resources prior to deploying the recovery vault. This ensures the resources are configured properly to support the service and allow for integration with the rest of the environment; both on-premises and in Azure.
It is important to note that the Recovery Services are location specific and can only interconnect with resources deployed in the same region.
Example: A recovery vault deployed in Central US region will not be able to connect to a network or storage resource that has been deployed in East US region.
Similarly, resources deployed in the classic portal cannot be interconnected with resources deployed in the new portal.
Example: Azure Recovery Vault created in the new portal cannot be configured to use a network provisioned in the classic portal even if it is deployed in the same region.
There are two primary resource types required for recovery services: Storage Accounts and Virtual Network.
A Geo Redundant Storage account (GRS) is a requirement for deploying Azure Recovery Services. GRS replicates data to a secondary region that is hundreds of miles away from the primary region allowing the data to be durable even in the case of a complete regional outage or a disaster in which the primary region is not recoverable. For a storage account with GRS enabled, an update is first committed to the primary region, where it is replicated three times. Then the update is replicated to the secondary region, where it is also replicated three times, across separate fault domains and upgrade domains. Note, the secondary region is automatically determined based on the primary region.
Fig 3. GRS Storage Account
Networking resources allow connectivity to a workload once it has been failed over to Azure. On-premise systems can access the Azure VMs via a site to site VPN connection deployed in the Azure network. By implementing a separate/stand-alone network, external connections can be made for testing workload failover and will have no impact on the production resources. Depending on the architecture, the deployment of a virtual network may require the creation of various objects. (Fig. 4) The figure below illustrates network objects that were created to define on-premise and Azure virtual networks.
Fig 4: Network objects in Resource Group
When configuring a VPN in Azure, it is important to note the differences between the two options available.
Replication Policies need to be created and applied to replication jobs to specify copy frequency and recovery point retention. (Fig 5) The recovery plans can be saved and reapplied for jobs with similar requirements. There are currently 3 options available for replication frequency intervals: 30 seconds, 5 minutes, or 15 minutes.
Fig 5 – Replication Policy Settings
Recovery plans consist of one or more ordered groups that contain virtual machines or physical servers that need to fail over together. Recovery plans do the following:
Every Azure subscription comes with a set of default limits on cores, cloud services, etc. It is recommended to run a test failover to validate the availability of resources in a subscription. Limits can be modified via Azure support. The table below highlights standard limits with Azure Recovery Services.
LIMIT IDENTIFIER | DEFAULT LIMIT |
Number of vaults per subscription | 25 |
Number of servers per Azure vault | 250 |
Number of protection groups per Azure vault | No limit |
Number of recovery plans per Azure vault | No limit |
Number of servers per protection group | No limit |
Number of servers per recovery plan | 50 |
Table 1: ASR Service Limits
The following guidelines can help optimize and scale an ASR deployment:
Azure Site Recovery is a powerful solution offering a wide range of options to ensure an organization’s IT systems are highly available and protected in the event of a disaster. Understanding an organization’s business continuity requirements along with the various components and features of Azure Recovery Service will assist in planning a resilient DR plan.
Written and composed by our Microsoft Systems Engineer, Raul R. Perez II of AdaptivEdge
References
https://azure.microsoft.com/en-us/documentation/articles/site-recovery-best-practices/
https://azure.microsoft.com/en-us/support/legal/sla/site-recovery/v1_0/
https://azure.microsoft.com/en-us/documentation/articles/site-recovery-create-recovery-plans/
https://azure.microsoft.com/en-us/documentation/articles/site-recovery-workload/
https://blogs.msdn.microsoft.com/rslaten/2014/12/08/static-vs-dynamic-routing-gateways-in-azure/