When architecting a cloud environment, “design for failure” is often seen as a best practice measure, ensuring a level of resiliency and availability that keeps the environment accessible even during cloud outages.
There are a number of common strategies that can be deployed, including the use of multiple availability zones, implementing premium storage tiers, and establishing an environment in a second (or even third) region for redundancy.
But ensuring resilience is often not the only priority for a cloud strategy. A level of cost control must also be applied. While these measures are effective at increasing the resilience of a Microsoft Azure environment, they all also increase its ultimate cost. As such, cost optimisation measures need to strike a balance between cost and availability.
Preparing for the worst
The design for failure methodology stems from the idea that resilience should be built with the worst-case scenario in mind. Avoiding disruptive downtime is important, but fears over accountability can also create an overly cautious resilience culture, which comes with a commercial impact. There is a subconscious desire to duplicate services, invest in significant buffer capacity and incorporate a range of “just in case” failover options.
But as the cloud environment grows and day-to-day administration continues, these decisions are rarely, if ever, revisited, allowing associated costs to increase as the environment sprawls. It’s this point-in-time application, coupled with an over-engineered architecture, which tips the scales against effective cost optimisation.
Avoiding over-engineering
There are a handful of common patterns that indicate an over-engineered environment, where measures taken to improve Azure resilience have grown out of control. These often stem from good intentions, but without clear architectural guidance, these strategies can spiral into unnecessary costs. Here are some common examples:
- Always-on capacity in multiple regions: Establishing a secondary region is a prudent strategy for high availability, but keeping both regions active duplicates costs. Often, the secondary region remains unused for prolonged periods, creating unnecessary resource wastage.
- Zone redundancy by default: Some Microsoft Azure services, like Azure Storage and PaaS, offer zone redundancy, which replicates data across availability zones. These are beneficial, but not always necessary, especially for development and test environments where this level of redundancy isn’t business critical.
- Unnecessary premium SKUs: Premium tiers of Azure services come with better SLAs and extra features. But these bonuses aren’t needed for all (or even most) workloads. In some cases, the evolution of an Azure service makes a premium tier redundant, such as the release of reliable Standard SSDs meaning the original Premium SSD tier isn’t needed to ensure resilience.
- Overprovisioning for hypothetical traffic: Avoiding downtime from traffic spikes is a critical concern for an Azure resilience strategy, but often these high hypothetical needs are also applied to backup regions and failover paths, despite the fact that while these are in use, traffic tends to decline significantly. Matching backup capacity to production scale can often be avoided to support cost optimisation, unless it’s a definite need for the business.
- Identical infrastructure across environments: While maintaining consistency across Development, Production and UAT environments is essential, they don’t all require the same architecture. This is especially true for more expensive services like regional failover, which can quickly lead to excessive costs if duplicated across all environments.
None of these strategies are inherently wrong in their own right – there are circumstances where any or all of them may be the best option for the organisation. However, they should be regularly reviewed against the actual needs of users, customers, and the business at large to identify and rectify overinvestment.
Optimising Azure resiliency
Overarching resilience remains crucial, but this doesn’t have to come at the expense of true cost controls. Many low-impact services and non-critical applications, for example, don’t need the best possible uptime or premium resource tiers, as they can tolerate brief outages or minor performance degradation without significantly impacting the organisation. Truly designing for failure requires a strategic approach to cloud architecture that identifies what needs to be prioritised and pays attention to recovery.
There are a few practical strategies that help to optimise costs while maintaining the resiliency of your Azure environment, helping to strike a balance:
- Utilise active-passive failover: Not every workload demands an active-active deployment across multiple environments. Active-passive models, which allow for redundancy to be kept without concurrent running costs, can be a good fit for internal tools or low-traffic, customer-facing applications. Depending on recovery time objectives, it may be worth keeping the passive region warm, so it’s ready to jump into action whenever needed.
- Select lower tiers for standby solutions: Secondary regions don’t need to perfectly replicate your production environment. Using standard or lower-tier SKUs for backup paths reduces costs without compromising Azure resilience, especially if failover events are rare.
- Focus on rapid recovery: Availability is only one side of resilience. Quick recovery, enabled by automation and snapshots, is critical for an effective resilience strategy and provides an alternative to relying on infrastructure duplication.
- Align strategy with service criticality: While critical production systems may warrant premium uptime guarantees, the whole environment does not. Tailor your Azure resilience strategy to factor in the potential impact of failure, rather than applying a uniform standard across workloads.
- Closely monitor failover costs: Continuous monitoring of costs is essential for optimisation, especially where standby regions exist. Keep a close eye on the resources and services used in these environments and ensure they don’t inadvertently transition into production usage.
Finding the balance
Cost optimisation and resilience can be competing priorities in Microsoft Azure, but by taking a careful, considered approach to cloud strategy, they can be reconciled. Consider the true business value of any given workload, and adjust its availability requirements accordingly, avoiding duplication, overprovisioning, and premium uptime guarantees when the potential risk is low.
The goal shouldn’t be to eliminate guardrails or undercut important redundancies, but to strike a more equitable balance, with a cost-conscious cloud architecture that ensures it delivers its full value, rather than becoming an unnecessary budget drain.
These measures aren’t the only way to optimise your Microsoft Azure environment. Advania Cloud Insights (ACI) can help identify redundancies and potential savings, as well as security and governance oversights that may be hampering overall resilience. It provides a critical level of visibility that guides the implementation of cost controls and unlocks the path to cost optimisation.
To find out more about the opportunities for optimisation available in Azure, read our previous blogs, which explore how to take advantage of a range of discounts, calculate the total cost of ownership of your environment, and automate FinOps processes to introduce new efficiencies.