Salla

Senior Site Reliability Engineer (SRE)

Permanent
Mecca, Saudi Arabia
Experience 2 - 5 yrs

View more jobs like this

Return to jobs page

Job overview

Date posted
16/02/2026
Location
Mecca, Saudi Arabia
Salary
SAR 20,000 - 30,000 per month
Compensation

Job description

Salla is seeking a Senior Site Reliability Engineer (SRE) for a hybrid role in Makkah, Saudi Arabia, responsible for leading reliability initiatives, improving platform performance, handling complex incidents, and guiding engineering teams in building resilient and fault-tolerant systems. The role includes participation in on-call rotations to support production systems. The SRE will lead high-severity incident response, perform post-incident reviews, troubleshoot complex issues across applications, infrastructure, and networks, and improve mean time to recovery (MTTR) through enhanced monitoring, alerts, and diagnostic tooling. Responsibilities include identifying and resolving performance bottlenecks, conducting load testing, capacity planning for high-traffic scenarios, enhancing cloud-native infrastructure, deployment processes, automation, resilience, and recovery mechanisms. The role also emphasizes observability, including building and refining dashboards, metrics, logs, traces, defining SLIs/SLOs, and improving visibility into system behavior. Additional duties involve developing tooling to reduce operational toil, contributing to infrastructure-as-code, CI/CD pipelines, and GitOps workflows, collaborating closely with engineering teams to ensure services are production-ready, and mentoring engineers on reliability, debugging, and operational best practices. The position requires expertise in Kubernetes, service mesh technologies, cloud platforms (AWS, GCP, or Azure), Linux, networking, distributed systems, load balancing, Terraform or equivalent IaC tools, observability platforms (Prometheus, Grafana, Loki, Mimir, Elastic, etc.), scripting/programming languages (Bash, Python, Go), CI/CD pipelines, GitOps practices, strong debugging, incident response, and performance analysis skills. Familiarity with fault-tolerant design, disaster recovery (DR), high-availability (HA) patterns, SLOs, SLIs, and error budgets is advantageous.

Required skills

Site Reliability Engineering (SRE)

incident management

high-severity incident response

post-incident reviews

troubleshooting applications

troubleshooting infrastructure

network troubleshooting

MTTR improvement

monitoring and alerting

diagnostic tooling

load testing

capacity planning

cloud-native infrastructure

automation

resilience and fault-tolerance

recovery mechanisms

observability

dashboards and metrics

logs and traces

Key responsibilities

Lead high-severity incident response and drive post-incident reviews to improve operational reliability
Troubleshoot complex issues across applications, infrastructure, and network systems ensuring timely resolution
Improve MTTR by implementing enhanced monitoring, alerts, and diagnostic tooling across production systems
Participate in the on-call rotation to provide continuous support for production systems
Identify and resolve performance bottlenecks and scaling challenges through load testing and capacity planning
Enhance cloud-native infrastructure, deployment processes, automation, and operational resilience
Build and refine observability platforms including dashboards, metrics, logs, and traces, and define SLIs/SLOs to improve system visibility
Develop tooling and automation to reduce operational toil and increase platform reliability
Contribute to infrastructure-as-code (Terraform), CI/CD pipelines, and GitOps workflows to streamline deployments
Collaborate with engineering teams to ensure services are robust, production-ready, and aligned with reliability standards
Mentor engineers on reliability, debugging, incident response, and operational best practices

Experience & skills

Strong experience with Kubernetes and service mesh technologies for managing cloud-native applications
Hands-on expertise in cloud platforms such as AWS, GCP, or Azure
Deep understanding of Linux systems, networking, distributed systems, and load balancing
Proficiency in Infrastructure-as-Code tools such as Terraform
Experience with observability platforms including Prometheus, Grafana, Loki, Mimir, Elastic, or equivalents
Proficiency in scripting and programming languages such as Bash, Python, or Go
Experience with CI/CD pipelines and GitOps practices for deployment automation
Demonstrated capability in debugging, incident response, and performance analysis for high-traffic systems
Familiarity with fault-tolerant design, disaster recovery, high-availability patterns, SLIs/SLOs, and error budgets is a plus
Ability to work collaboratively in hybrid team environments and participate effectively in on-call rotations

Return to jobs page

Share job opening, get 1-month free Private Network access (worth 99 AED)

Senior Site Reliability Engineer (SRE)

Job overview

Date posted

Location

Salary

Compensation

Job description

Required skills

Key responsibilities

Experience & skills

Experience

Seniority

Qualification

Expiration date