
Salla
Senior Site Reliability Engineer (SRE)
- Permanent
- Mecca, Saudi Arabia
- Experience 2 - 5 yrs
Job expiry date: 02/04/2026
Job overview
Date posted
16/02/2026
Location
Mecca, Saudi Arabia
Salary
SAR 20,000 - 30,000 per month
Compensation
Job description
Salla is seeking a Senior Site Reliability Engineer (SRE) for a hybrid role in Makkah, Saudi Arabia, responsible for leading reliability initiatives, improving platform performance, handling complex incidents, and guiding engineering teams in building resilient and fault-tolerant systems. The role includes participation in on-call rotations to support production systems. The SRE will lead high-severity incident response, perform post-incident reviews, troubleshoot complex issues across applications, infrastructure, and networks, and improve mean time to recovery (MTTR) through enhanced monitoring, alerts, and diagnostic tooling. Responsibilities include identifying and resolving performance bottlenecks, conducting load testing, capacity planning for high-traffic scenarios, enhancing cloud-native infrastructure, deployment processes, automation, resilience, and recovery mechanisms. The role also emphasizes observability, including building and refining dashboards, metrics, logs, traces, defining SLIs/SLOs, and improving visibility into system behavior. Additional duties involve developing tooling to reduce operational toil, contributing to infrastructure-as-code, CI/CD pipelines, and GitOps workflows, collaborating closely with engineering teams to ensure services are production-ready, and mentoring engineers on reliability, debugging, and operational best practices. The position requires expertise in Kubernetes, service mesh technologies, cloud platforms (AWS, GCP, or Azure), Linux, networking, distributed systems, load balancing, Terraform or equivalent IaC tools, observability platforms (Prometheus, Grafana, Loki, Mimir, Elastic, etc.), scripting/programming languages (Bash, Python, Go), CI/CD pipelines, GitOps practices, strong debugging, incident response, and performance analysis skills. Familiarity with fault-tolerant design, disaster recovery (DR), high-availability (HA) patterns, SLOs, SLIs, and error budgets is advantageous.
Required skills
Key responsibilities
- Lead high-severity incident response and drive post-incident reviews to improve operational reliability
- Troubleshoot complex issues across applications, infrastructure, and network systems ensuring timely resolution
- Improve MTTR by implementing enhanced monitoring, alerts, and diagnostic tooling across production systems
- Participate in the on-call rotation to provide continuous support for production systems
- Identify and resolve performance bottlenecks and scaling challenges through load testing and capacity planning
- Enhance cloud-native infrastructure, deployment processes, automation, and operational resilience
- Build and refine observability platforms including dashboards, metrics, logs, and traces, and define SLIs/SLOs to improve system visibility
- Develop tooling and automation to reduce operational toil and increase platform reliability
- Contribute to infrastructure-as-code (Terraform), CI/CD pipelines, and GitOps workflows to streamline deployments
- Collaborate with engineering teams to ensure services are robust, production-ready, and aligned with reliability standards
- Mentor engineers on reliability, debugging, incident response, and operational best practices
Experience & skills
- Strong experience with Kubernetes and service mesh technologies for managing cloud-native applications
- Hands-on expertise in cloud platforms such as AWS, GCP, or Azure
- Deep understanding of Linux systems, networking, distributed systems, and load balancing
- Proficiency in Infrastructure-as-Code tools such as Terraform
- Experience with observability platforms including Prometheus, Grafana, Loki, Mimir, Elastic, or equivalents
- Proficiency in scripting and programming languages such as Bash, Python, or Go
- Experience with CI/CD pipelines and GitOps practices for deployment automation
- Demonstrated capability in debugging, incident response, and performance analysis for high-traffic systems
- Familiarity with fault-tolerant design, disaster recovery, high-availability patterns, SLIs/SLOs, and error budgets is a plus
- Ability to work collaboratively in hybrid team environments and participate effectively in on-call rotations