Redundancy, Availability, Continuity, and Disaster Recovery
A key benefit of virtualising the infrastructure is to construct a platform which provides best-of-breed and reasonable, redundancy, availability and access; along with continuity and recovery. Most on-premise systems suffer from a lack of provisioning of these services, and moving to a cloud-virtualisation platform will certainly offer the opportunity to create a highly redundant and failure-proof system.
Redundancy is achieved through a combination of hardware and/or software with the goal of ensuring continuous operation even after a failure. Should a primary component fail for any reason, the secondary systems are already online and take over seamlessly. Examples of redundancy are multiple power and cooling modules within a server, a RAID-enabled disk system, or a secondary network switch running in standby mode to take over if the primary network switch fails.
High availability (HA) is the concept of maximizing system uptime to achieve as close to 100% availability as possible. HA is often measured by how much time the system is online versus unscheduled outages—usually shown as a percentage of uptime over a period of time. Goals for cloud providers and customers consuming cloud services are often in the range of 99.99% uptime per year. The SLA will determine what the cloud provider is guaranteeing and what outages, such as routine maintenance, fall out- side of the uptime calculation.
Many VMs, OSs, and applications will take longer than this just to boot up so HA configurations are necessary to achieve higher uptime requirements.
To keep your systems at the 99.99% level or better, you must design your system with redundancy and HA in mind. If you are targeting a lesser SLA, disaster recovery or standby systems might be adequate.
You can achieve the highest possible availability through various networking, application, and redundant server techniques, such as the following:
The Figure below is an example of an HA scenario. In this example, a VM has failed and secondary VMs are running and ready to immediately take over operations. This configuration has two redundant servers one in the same datacenter on a separate server blade and another in the secondary datacenter. Failing-over to a server within the same datacenter is ideal and the least likely to impact customers. The redundant servers in the secondary datacenter can take over primary operations should multiple servers or the entire primary datacenter experience an outage.
Example of HA: a failover
Continuity of operations
Continuity of operations (CoO) is the concept of offering services even after a significant failure or disaster. Essentially CoO is a series of failover techniques to keep network, servers, storage, and applications running and available. In the real world, CoO refers to a broader range of keeping your entire service online after a significant failure or disaster.
A continuity plan would typically involve failing-over to a secondary datacenter (or region in the example of AWS); in the event that the primary datacenter becomes unavailable or involved in a disaster. The network infrastructure, server farms, storage, and applications at the secondary datacenter are roughly the same as those in the primary, and most important, the data from the primary datacenter is always being replicated to the secondary.
This combination of having pre-staged infrastructure and synchronized data is what make it possible to continue servicing users. The failover time in such a scenario is sometimes measured in days however (much of DR must be manually induced), with some applications on-line in 4-6 hours.
Another part of a continuity plan deals with your staff and support personnel. If you must failover to a secondary datacenter, how will you adequately manage your cloud environment at that time? Will your staff be able to work from home if the primary office or datacenter location is compromised? The logistics and business plans are a huge part of a complete continuity of operations plan—it isn’t only about the technology.
A DR plan is similar to a CoO plan, but with one important addition. A DR plan also includes how to rebuild the datacenter, server farms, storage, network, or any portion that was damaged by the disaster event. In the case of a total datacenter loss, the DR plan might contain strategies to build a new datacenter in another cloud as well, an example would be to recreate your AWS architecture, through manual and automated configuration; in Azure.
The DR plan needs to document steps to bring the secondary datacenter up to its “primary” counterpart’s standards in the event that there is no hope of returning operations back to the primary. DR will and should remain a predominantly manual process – see below.
The figure below represents a scenario in which the primary datacenter has failed, lost connectivity, or is otherwise entirely unavailable to handle production customers. Due to the replication of data (through the SAN in this example) and redundant servers, all applications and operations are now shifted to the secondary datacenter. A proper CoO plan not only allows for the failover from the primary to secondary datacenter(s), but also documents a plan to either switch back all operations to the first datacenter after the problem has been resolved.
Figure: Example of disaster recovery—a continuity scenario
DR is mostly a manual process. There are many reasons for this: