
Salla
Senior Site Reliability Engineer – Data & ML Ops
- Permanent
- Medina, Saudi Arabia
- Experience 5 - 10 yrs
- Urgent
Job expiry date: 12/05/2026
Job overview
Date posted
28/03/2026
Location
Medina, Saudi Arabia
Salary
SAR 20,000 - 30,000 per month
Compensation
Comprehensive package
Experience
5 - 10 yrs
Seniority
Senior & Lead
Qualification
Bachelors degree
Expiration date
12/05/2026
Job description
We are seeking a Senior Site Reliability Engineer (SRE) with strong Data & ML Ops expertise to join our hybrid team in Madinah, Saudi Arabia. The ideal candidate will own the reliability, scalability, and performance of a rapidly growing platform infrastructure, working across customer-facing applications, internal platforms, and data pipelines. This role emphasizes automation, cloud infrastructure management, monitoring, and proactive incident prevention, ensuring highly available, secure, and cost-efficient production systems.
Required skills
Key responsibilities
- Design, deploy, monitor, and maintain production workloads across multi-cluster Kubernetes environments (EKS/AKS/GKE).
- Build self-healing, auto-scaling systems that minimize manual intervention and ensure uptime for mission-critical services.
- Design and operate reliable database and storage platforms (SQL, NoSQL, and object storage) within Kubernetes environments.
- Implement backup, disaster recovery, replication, and failover strategies to meet RPO/RTO objectives.
- Troubleshoot and recover Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues) during incidents.
- Optimize storage performance and cost using multi-tier strategies, hot/cold data separation, and lifecycle policies.
- Secure and scale object storage platforms (MinIO/S3-compatible) for high-throughput data pipelines.
- Manage block and shared storage (EBS/io2/gp3, EFS, NFS) for performance, resiliency, and cost balance.
- Champion GitOps and CI/CD best practices using ArgoCD, Flux, and GitHub Actions; automate infrastructure provisioning and upgrades with Terraform, Helm, and Kubernetes Operators.
- Lead monitoring and alerting stack operations (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch) and participate in incident response, root cause analysis, and postmortems.
- Implement security best practices including IAM policies, RBAC, secret management, network ACLs/firewalls, and secure image supply chain enforcement.
- Collaborate with application, platform, and data teams to optimize performance, cost, and operational efficiency of cloud and on-premise infrastructure.
- Introduce cost visibility dashboards and continuous improvement initiatives to ensure efficient resource usage.
Experience & skills
- 8+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
- Deep Kubernetes expertise with multi-cluster management, Helm chart development, and advanced networking.
- Proficiency with GitOps workflows using ArgoCD or Flux.
- Hands-on experience with cloud infrastructure (AWS preferred, Azure/GCP acceptable) and Infrastructure-as-Code tools (Terraform, Pulumi, CloudFormation).
- Strong knowledge of SQL & NoSQL databases (MySQL/Aurora, PostgreSQL, MongoDB, Redis).
- Advanced scripting and automation skills in Python, Bash, or Go.
- Experience with observability and monitoring tools (Prometheus, Grafana, Loki, ELK/OpenSearch, VictoriaMetrics).
- Experience with CI/CD pipelines, progressive delivery strategies, and production incident management.
- Familiarity with streaming/messaging platforms (Kafka, RabbitMQ, or similar) is a plus.
- Strong problem-solving skills, teamwork, and effective communication across engineering, DevOps, security, and product teams.