Why it is important to write good code!

First there was Waterfall, then Agile… and then DevOps and now there is SRE (Site Reliability Engineering)…

But what is SRE? There is no official definition but we use what Ben Treynor (VP Engineering @Google) quoted: “SRE is what happens when a S/W Engineer is tasked with what used to be called Operations”.

SRE was developed in Google and it has been their approach over a decade on what we refer as a Service Management. It is not academic, it is something that came out of Operations department themselves. The fact that is adopted in Google of course increased its popularity.

SRE is not simply an approach. It is a whole effort that brings together development and operations aiming in finding the balance between releasing new features and making sure that they are reliable to the users (DEV vs OPS). SRE is a mindset that focuses on three aspects:

  1. Risk Assessment
  2. Standardisation
  3. Automation

It reduces ‘silos’ (an SRE engineer is a role that requires the background as a s/w developer with additional operations experience or as a sysadmin or an IT Operations role that has also s/w development skills), people share ownership and have a holistic knowledge related to pipeline, tests, deploy, load balancers, databases, infrastructure and configuration (to name as a few…).

SRE is structural and requires procedures such as Change Management, Availability Management, Emergency response, Capacity Management and measurements (e.g. SLA, SLI & SLO[*]).

Risk Assessment plays a very important role in decision making (e.g. product development will stop for x number of Sprints because the system is down for x amount of time). With SRE failure is planned and accepted. An important definition found in the SRE is the so-called ‘error-budget’, which depends on what the Risk Assessment will reveal. Error-budget is a control mechanism, it is simply your downtime that is defined and calculated through the various SLIs. Each time your SLI fails you are confirming a portion of you allowed error-budget.

But I don’t want to trouble you with calculations especially now that we are in the middle of the summer ?

The whole idea of SRE and the rule is that s/w should solve the problems of infrastructure! Strongly automate, write good code and invest/persist on quality, use modern application platforms such as Kubernetes and microservices, and don’t be afraid to get your hands dirty! ?


Author: Rania Alexiou 


[*] SLA: Service-level Agreement, SLI: Service-level Indicators, SLO: Service-level Objectives