
Lucidya
Site Reliability Engineer
- Permanent
- Riyadh, Saudi Arabia
- Experience 2 - 5 yrs
Job expiry date: 08/05/2026
Job overview
Date posted
24/03/2026
Location
Riyadh, Saudi Arabia
Salary
SAR 15,000 - 20,000 per month
Compensation
Salary only
Experience
2 - 5 yrs
Seniority
Experienced
Qualification
Bachelors degree
Expiration date
08/05/2026
Job description
As a Site Reliability Engineer at Lucidya in Riyadh, you will ensure the reliability, performance, and scalability of our AI-native customer experience platform. You will design, implement, and maintain highly available, fault-tolerant infrastructure across cloud environments, proactively identify potential failures, and build automation to eliminate manual operational work. You will manage Kubernetes clusters, optimize cloud resources, improve CI/CD pipelines, and implement monitoring and observability systems to prevent downtime. The role requires close collaboration with engineering and DevOps teams, responding to incidents, performing root cause analyses, and driving improvements to make our systems robust and scalable.
Required skills
Key responsibilities
- Design and maintain infrastructure that is highly available, fault-tolerant, and scalable
- Proactively identify and eliminate single points of failure to prevent incidents
- Manage cloud workloads across AWS, GCP, or Azure using Infrastructure as Code (Terraform)
- Operate and scale Kubernetes clusters, troubleshoot issues, and ensure smooth deployments
- Implement and refine monitoring and alerting systems (Prometheus, Grafana, Datadog, ELK)
- Respond to incidents, lead root cause analysis, and implement preventive measures
- Automate workflows and infrastructure management to eliminate repetitive manual tasks
- Optimize cloud resource usage to balance cost and performance
- Collaborate with DevOps and engineering teams to solve performance bottlenecks
- Contribute to CI/CD improvements and deployment reliability
- Document infrastructure, processes, and incidents to support knowledge sharing
- Identify opportunities to improve system reliability, scalability, and operational efficiency
Experience & skills
- 3+ years of experience in SRE, DevOps, or infrastructure engineering
- Hands-on experience with cloud platforms (AWS, GCP, Azure) and distributed systems
- Proficient with Kubernetes and Docker in production environments
- Experience with Infrastructure as Code (Terraform or similar)
- Strong scripting skills in Python, Bash, or similar languages
- Understanding of CI/CD pipelines and automation
- Knowledge of networking, load balancing, and high-availability design
- Experience implementing monitoring and observability tools (Prometheus, Grafana, Datadog, ELK)
- Ability to troubleshoot complex issues and perform root cause analysis
- Calm under pressure and methodical in incident response
- Excellent communication and collaboration skills
- Ownership mindset and proactive approach to reliability challenges
- Cloud or Linux certifications are a plus
- Experience with RabbitMQ or Redis in production environments is a plus
- Familiarity with Ansible or AWX is advantageous
- Exposure to multi-cloud or hybrid environments is a plus