Salla

Senior Site Reliability Engineer – Data & ML Ops

Permanent
Medina, Saudi Arabia
Experience 5 - 10 yrs
Urgent

View more jobs like this

Return to jobs page

Job overview

Date posted
28/03/2026
Location
Medina, Saudi Arabia
Salary
SAR 20,000 - 30,000 per month
Compensation
Comprehensive package
Experience
5 - 10 yrs
Seniority
Senior & Lead
Qualification
Bachelors degree
Expiration date
12/05/2026

Job description

We are seeking a Senior Site Reliability Engineer (SRE) with strong Data & ML Ops expertise to join our hybrid team in Madinah, Saudi Arabia. The ideal candidate will own the reliability, scalability, and performance of a rapidly growing platform infrastructure, working across customer-facing applications, internal platforms, and data pipelines. This role emphasizes automation, cloud infrastructure management, monitoring, and proactive incident prevention, ensuring highly available, secure, and cost-efficient production systems.

Required skills

Site Reliability Engineering (SRE)

Kubernetes (EKS/AKS/GKE)

Cloud Platforms (AWS, Azure, GCP)

Infrastructure as Code (Terraform, Pulumi, CloudFormation)

CI/CD and GitOps (ArgoCD, Flux, GitHub Actions)

Database Administration (SQL, NoSQL, MySQL, PostgreSQL, MongoDB, Redis, Aurora)

Storage Management (EBS, EFS, NFS, MinIO/S3-compatible)

Observability & Monitoring (Prometheus, Grafana, Loki, VictoriaMetrics, ELK/OpenSearch)

Networking & Service Mesh (Istio, Linkerd, ingress/egress control)

Automation & Scripting (Python, Bash, Go)

Disaster Recovery & Backup Strategies

Performance & Cost Optimization

Incident Management & Postmortem Analysis

Security & Compliance (IAM, RBAC, secret management, network ACLs/firewalls)

Data Platform Integration (Airflow, Debezium, ClickHouse)

Key responsibilities

Design, deploy, monitor, and maintain production workloads across multi-cluster Kubernetes environments (EKS/AKS/GKE).
Build self-healing, auto-scaling systems that minimize manual intervention and ensure uptime for mission-critical services.
Design and operate reliable database and storage platforms (SQL, NoSQL, and object storage) within Kubernetes environments.
Implement backup, disaster recovery, replication, and failover strategies to meet RPO/RTO objectives.
Troubleshoot and recover Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues) during incidents.
Optimize storage performance and cost using multi-tier strategies, hot/cold data separation, and lifecycle policies.
Secure and scale object storage platforms (MinIO/S3-compatible) for high-throughput data pipelines.
Manage block and shared storage (EBS/io2/gp3, EFS, NFS) for performance, resiliency, and cost balance.
Champion GitOps and CI/CD best practices using ArgoCD, Flux, and GitHub Actions; automate infrastructure provisioning and upgrades with Terraform, Helm, and Kubernetes Operators.
Lead monitoring and alerting stack operations (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch) and participate in incident response, root cause analysis, and postmortems.
Implement security best practices including IAM policies, RBAC, secret management, network ACLs/firewalls, and secure image supply chain enforcement.
Collaborate with application, platform, and data teams to optimize performance, cost, and operational efficiency of cloud and on-premise infrastructure.
Introduce cost visibility dashboards and continuous improvement initiatives to ensure efficient resource usage.

Experience & skills

8+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
Deep Kubernetes expertise with multi-cluster management, Helm chart development, and advanced networking.
Proficiency with GitOps workflows using ArgoCD or Flux.
Hands-on experience with cloud infrastructure (AWS preferred, Azure/GCP acceptable) and Infrastructure-as-Code tools (Terraform, Pulumi, CloudFormation).
Strong knowledge of SQL & NoSQL databases (MySQL/Aurora, PostgreSQL, MongoDB, Redis).
Advanced scripting and automation skills in Python, Bash, or Go.
Experience with observability and monitoring tools (Prometheus, Grafana, Loki, ELK/OpenSearch, VictoriaMetrics).
Experience with CI/CD pipelines, progressive delivery strategies, and production incident management.
Familiarity with streaming/messaging platforms (Kafka, RabbitMQ, or similar) is a plus.
Strong problem-solving skills, teamwork, and effective communication across engineering, DevOps, security, and product teams.

Return to jobs page

Share job opening, get 1-month free Private Network access (worth 99 AED)

Senior Site Reliability Engineer – Data & ML Ops

Job overview

Date posted

Location

Salary

Compensation

Experience

Seniority

Qualification

Expiration date

Job description

Required skills

Key responsibilities

Experience & skills