SRE challenges and how to overcome them with insights from Ardent SRE Expert

3 March 2023 | Noor Khan

SRE challenges and how to overcome them

Site Reliability Engineering (SRE) is a growing discipline that ensures the uptime and availability of software by bridging the gap between development and operations teams. By leveraging software to effectively manage and monitor applications, Site Reliability Engineers enable software to scale easily without having to manually manage multiple systems.

SRE offer multiple benefits such as improved collaboration of teams, increased efficiency through software and automation, building a culture of continuous improvement and improved levels of software reliability and resilience. However, these benefits do not come without some challenges. In this article, we will look at some of the key SRE challenges and how to overcome them with insights from Ardent’s experienced Site Reliability Engineer.

9 challenges of SRE and how to overcome them

Resistance to the SRE approach

SRE is a relatively new discipline, and it may be met with resistance by teams as it requires a change of mindset and approach. To overcome this, we recommend you pilot the SRE methodology to measure its success with key metrics. If goals and objectives are met, a further rollout would be the next step. Additionally, provide training to teams to ensure they are grasping the SRE approach, therefore are able to adopt it and implement it effectively.

Choosing the right tools and technologies

Choosing the right tech stack can be challenging especially if you do not have the expertise in-house to do so. Therefore, identify your key objectives and what you want to achieve by clearly defining your metrics of success. This can then inform your tool and technologies of choice. For example, at Ardent one key metric, we measure is the response time to incidents. A brilliant technology which enables us to swiftly report and communicate errors is PagerDuty.

Explore PagerDuty use cases below:

“There are a wide variety of SRE technologies to choose from including the likes of Prometheus for web monitoring and AWS Cloud Watch and Data Log for data monitoring. How we choose the right technology by establishing the client's budget and the technical requirements. For example, open source technologies are cost-effective as they are free to us.”

Shoaib Mulani, Site Reliability Engineer

Ensuring continuous reliability and uptime

The ultimate objective of SRE is to ensure continuous reliability and uptime of software by using processes and software put in place. This can be challenging, especially when the software is updated regularly, whether it is maintenance updates or feature updates. To overcome the challenge of ensuring 100 percent uptime, SRE must take a very structured and organised approach to error detection, communication and resolution.

“Automation is essential to every SRE team to ensure the reliability of infrastructure and applications. The two processes which should be automated are monitoring and reporting. For example, for a client project, we measure the server disk utilization to ensure there is no downtime. The monitoring and error reporting are automated to ensure that an alert pops up when the disk is fully utilised”.

Shoaib Mulani, Site Reliability Engineer

Additionally, if your software requires continuous ongoing operational monitoring and support, you may want to outsource the process as it can be a cost-effective solution.

Selecting the right metrics to measure

“A common challenge many SRE teams face is the metrics they should be following. To overcome this challenge, at Ardent we measure traditional metrics such as CPU utilisation, and disk utilisation to name a few. However, for each client, we discuss their goals and objectives to identify and set the key metrics.”

Shoaib Mulani, Site Reliability Engineer

Managing incidents effectively

Managing incidents effectively will directly impact the reliability and uptime of software both in the present and future. However, many businesses find that there are no effective, structured processes in place which mean there is a lack of learning from errors and mistakes, consequently resulting in the repetition of those mistakes.

The following are the steps every SRE should take to overcome this challenge:

  • Establish set and structured procedures and policies in line with SLAs and ensure they are followed every time. You can do this with training and implementing it as a stage in the workflow of the SRE team. These can range from relevant parties to communication to the steps to take when an incident is first detected.
  • Take an organised approach to recording incidents and maintain records as and when they happen.
  • Performing root cause analysis (RCA) to mitigate risks of these errors occurring again.
  • Keep documentation and track everything including the post-mortem reports of major incidents to better prepare your organisation for any incidents in the future.
  • Communication is crucial to ensuring effective SRE within your organisation. Hence, SRE teams must implement an effective, clear communication model. There should also be regular communication, daily, weekly and monthly to your business requirements.

“Keeping up and maintaining a run book is vital for every SRE team. Having set procedures in place to deal with incidents enables engineers to react quickly in order for a quick resolution.”

Shoaib Mulani, Site Reliability Engineer

Meeting service level expectations

To effectively meet the service level expectations, the following should be clearly established and communicated with the entire SRE team and the key stakeholder.

  • SLA (Service Level Agreement) – The SLA will cover and detail how the service will be delivered, the communication channels and frequency, reporting type and frequency and more.
  • SLI (Service Level Indicators) – These are key metrics such as response time. For example, you may set a response time threshold of 30 minutes. If this is breached, then it becomes a problem.
  • SLO (Service Level Objectives) – These are the core objectives, for example, ensuring 95% uptime.

Implementing automation

With automation at the heart of the SRE discipline, implementing automation across all processes is key to reducing toil which uses up time that is used to focus on high-value, mission-critical tasks. There are many brilliant automation tools on the market, including Terraform, Docker and Ansible.

Security challenges

Security is a common challenge that SRE teams will face from time to time, therefore research and knowledge are key. To overcome common security challenges, ensure you are aware of the limitations of your tech stack when it comes to security. If these limitations present gaps within your solutions, this should be reported to the development team”.

Shoaib Mulani, Site Reliability Engineer

Finding and retaining valuable SREs

Hiring and retaining SREs remains a great challenge for many organisations, with DevOps.com reporting that demand for SRE-specific skills is high. If this is a challenge for your organisation, a great and much more cost-effective solution to overcoming this is working with a third party and outsourcing your SRE processes. This reduces the time and resources required in finding, hiring and training SRE professionals.

SRE best practices as highlighted by Ardent’s Site Reliability Engineer Shoaib Mulani

  • Continuously engaging in and improving the whole lifecycle of services from inception and design, through deployment, operation and refinement.
  • Supporting services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • Maintaining services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scaling systems sustainably through mechanisms like automation.
  • Evolving and optimising systems by actively pushing to create change that improves reliability and velocity.
  • Practicing sustainable incident response and blameless postmortems.

Key SRE benefits

  • Reduce software downtime
  • Bridge the gap between platform design, development and operations
  • Increase security and compliance
  • Mitigate the risk of human error with automation
  • Gain visibility into the health and the performance of software and system

SRE challenges and how to overcome them with Ardent

At Ardent, we have worked with many clients to provide ongoing operational monitoring and support of their systems, applications and data. Ardent operational monitoring and support service incorporates the SRE discipline and offers invaluable benefits such as:

  • Continuous improvement and optimisation
  • Peace of mind with your systems, software and data being expert hands
  • Swift error detection and resolution
  • A clear, defined structured approach
  • Around-the-clock monitoring and support

If you are facing the SRE challenges mentioned in this article and are exploring outsourcing SRE, then you have come to the right place. Get in touch to find out more and we can discuss our three-tier structure to find a solution that is unique to your challenges, needs and requirements.

Ardent Expert: Shoaib Mulani

Shoaib Mulani is a highly knowledgeable Site Reliability Engineer with significant experience in the field. He has worked on many SRE projects leveraging a wide variety of SRE tools and technologies to deliver excellence to our clients.


Ardent Insights

Overcoming Data Administration Challenges and Strategies for Effective Data Management

Businesses face significant challenges to continuously manage and optimise their databases, extract valuable information from them, and then to share and report the insights gained from ongoing analysis of the data. As data continues to grow exponentially, they must address key issues to unlock the full potential of their data asset across the whole business. [...]

Read More... from SRE challenges and how to overcome them with insights from Ardent SRE Expert

Are you considering AI adoption? We summarise our learnings, do’s and don’ts from our engagements with leading clients.

How Ardent can help you prepare your data for AI success Data is at the core of any business striving to adopt AI. It has become the lifeblood of enterprises, powering insights and innovations that drive better decision making and competitive advantages. As the amount of data generated proliferates across many sectors, the allure of [...]

Read More... from SRE challenges and how to overcome them with insights from Ardent SRE Expert

Why the Market Research sector is taking note of Databricks Data Lakehouse.

Overcoming Market Research Challenges For Market Research agencies, Organisations and Brands exploring insights across markets and customers, the traditional research model of bidding for a blend of large-scale qualitative and quantitative data collection processes is losing appeal to a more value-driven, granular, real-time targeted approach to understanding consumer behaviour, more regular insights engagement and more [...]

Read More... from SRE challenges and how to overcome them with insights from Ardent SRE Expert