Google deletes customer, Sogode publishes HOWTO
This week’s headlines have us talking about how to best prepare for inevitable cloud systems failures and, continuing on the topic, we have published a comprehensive guide on how to get started with Network as Code (or Infrastructure as Code, for that matter).
Google accidentally removes customer
From Australia comes the news that Google accidentally deleted all of UniSuper‘s private cloud environment “due to an error in provisioning”. As much as the incident will be analysed to death, the more interesting question is: “How will they recover?”.
This week we’ll look at how cloud environments can fail and what can be done in terms of recoverability.
How can clouds fail?
With increasing automation, it becomes exceedingly likely that similar occurrences will happen again. Having backups of virtual machines or databases no longer serves the purpose, as the complete infrastructure may be deleted.
Or the content of your Microsoft Entra ID, all of the Meraki networks or your entire CRM history.
Failure as such is not a black and white state of either working or not.
Failure is any situation where the system does not perform as normal.
Failure includes general slowness, inability to log on in a particular continent, normal queries returning less or strange data, etc.
Security Certification
A number of relevant certification standards exist, most notably ISO27001 and ISO22301.
The ISO/IEC 27001 standard provides companies of any size and from all sectors of activity with guidance for establishing, implementing, maintaining and continually improving an information security management system.
ISO 22301 is the international standard for Business Continuity Management Systems (BCMS). It provides a framework for organizations to plan, establish, implement, operate, monitor, review, maintain, and continually improve a documented management system to protect against, reduce the likelihood of, and ensure recovery from disruptive incidents.
How to prepare for SaaS failure
With SaaS failures, it’s important to remember that the service comes with an SLA and therefore the likelihood of it failing is very small. Besides, you can’t really temporarily use a similar service whilst the SaaS platform is unavailable.
The more likely scenario is content no longer being reliable due to human error, a hack or internal sabotage. The list of possibilities is endless and history has always shown that, against all odds, ‘The Unthinkable‘ does happen.
This means that user profiles may have been tampered with, financial data may have been overwritten or settings may have been removed.
Preparation for SaaS failure must include periodic backups of settings and content of the platform, paired with recovery exercises to ensure validity of the backup process.
Do
- Take regular, automated backups
- Restore backups at least once annually
- Store backups outside of the SaaS platform
- Protect backups from being overwritten or deleted
Don’t
- Assume SaaS is 100% safe and sound
- Entertain backup SaaS for SaaS
How to prepare for IaaS failure
Any IaaS resource must be deployed from code stored in a central repository, never from clicking around the GUI or web console.
Deployment from code implicitly enables a safe and guaranteed method to redeploy the exact same infrastructure. The code repository must be kept away from the IaaS platform and be backed up like any other SaaS platform.
Use a generic IaC platform to ensure it’s not included in the blast radius from any faults in the IaaS platform. Most cloud platforms provide excellent IaC tools however these may get caught up in the same fault that affects the platform, which prevents you from restoring the service.
Observability has to be built in to the infrastructure so that we establish a baseline of normal behaviour and the ability to automatically detect any deviance from the norm.
Lastly, any deployment must be subject to change control. This can be fully automated and instant by using standard changes with preauthorisation so as not to slow down operations. Change control is important because it provides a timeline with events that can be played back to establish when faults were introduced and to roll back to a state that was known to work well.
Do
- Use change control
- Create IaC modules and reuse them
- Include observability in deployments
- Ban any and all deployments from interactive consoles
Don’t
- Allow any changes to infrastructure from interactive console
- Use a cloud-specific IaC environment
Sogode publishes howto
We have published a guide on how to get started with deploying Network as Code, using Terraform and Github.
The guide produces a fully working topology in AWS with multi-region connectivity and a template for ‘branches’ (VPCs) per application to securely separate resources.
More extensive guides will follow to advance on the topic and include operational practices.