
G42
Lead Site Reliability Engineer - Azure & Hybrid Platforms
- Permanent
- Abu Dhabi, United Arab Emirates
- Experience 2 - 5 yrs
- Urgent
Job expiry date: 18/04/2026
Job overview
Date posted
04/03/2026
Location
Abu Dhabi, United Arab Emirates
Salary
Undisclosed
Compensation
Job description
The Lead Site Reliability Engineer - Azure & Hybrid Platforms at Inception, a G42 company based in Abu Dhabi, is responsible for owning the reliability, observability, and automation of Azure and hybrid environments including Azure Stack and on-prem platforms. The role leads Site Reliability Engineering practices for AI, data, and business-critical application services, driving a cloud-agnostic DevSecOps toolchain and ensuring platforms are secure, scalable, resilient, and cost-efficient. The position requires deep expertise in Microsoft Azure at scale, Azure Data and AI services such as Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services, as well as strong hands-on experience with containers and Kubernetes including Azure Kubernetes Service (AKS), autoscaling, upgrades, and production operations. The engineer builds and maintains end-to-end observability using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms, implementing metrics, logs, traces, dashboards, alerts, and low-noise alerting across Azure and on-prem environments. The role leads Infrastructure-as-Code and automation initiatives using Terraform, Bicep, Ansible, and scripting in Python, PowerShell, Bash and Go, driving self-healing systems, runbook-driven operations, AI-assisted orchestration, and autonomous agents to automate security compliance and infrastructure management. Responsibilities include owning SLOs/SLIs, error budgets, capacity planning, cost/performance optimisation, GPU/accelerator workload optimisation, and ensuring hybrid networking and secure connectivity using ExpressRoute/VPN, private endpoints, Azure AD, and key management. The Lead SRE manages P0/P1 incidents, on-call rotations, blameless post-mortems, and long-term reliability improvements using ITSM and DevSecOps tools such as ServiceNow, Jira, ManageEngine, cloud-agnostic CI/CD, security scanning, and policy-as-code, while operating within Agile/Scrum and ITIL processes and supporting ISO 27001 compliance and external audits. Success is defined by achieving 99.9%+ availability, MTTD < 5 minutes, MTTR < 15–30 minutes for P0 incidents, approximately 50% reduction in manual toil through automation and self-service, and documented and tested DR/BCP for AI, data, and application platforms.
Required skills
Key responsibilities
- Own SLOs/SLIs, error budgets, and overall reliability for Azure and on-prem platforms supporting data, AI/ML, and business-critical applications, ensuring 99.9%+ availability and measurable service performance outcomes.
- Plan and optimise capacity, performance, and cost across compute, storage, networking, and GPU/accelerator workloads, implementing capacity planning and cost/performance optimisation strategies for scalable Azure and hybrid environments.
- Build and maintain end-to-end observability using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms, implementing metrics, logs, traces, dashboards, and alerts with meaningful low-noise alerting.
- Lead Infrastructure-as-Code and automation initiatives using Terraform, Bicep, Ansible, and scripting in Python, PowerShell, Bash and Go, driving self-healing systems, runbook-driven operations, AI-assisted orchestration, and autonomous agents to reduce manual toil by approximately 50%.
- Operate and manage Azure, Azure Stack, and on-prem Kubernetes and Azure Kubernetes Service (AKS) clusters, ensuring autoscaling, upgrades, production operations, and secure, resilient hybrid connectivity using ExpressRoute/VPN, private endpoints, Azure AD, and key management.
- Lead P0/P1 incident response, on-call rotations, stakeholder communication, blameless post-mortems, and implement long-term fixes to achieve MTTD < 5 minutes and MTTR < 15–30 minutes for critical incidents.
- Use ITSM and DevSecOps tools including ServiceNow, Jira, ManageEngine, cloud-agnostic CI/CD, security scanning, and policy-as-code to manage change, incidents, compliance, ISO 27001 alignment, external audits, and ensure DR/BCP documentation and testing for AI, data, and application platforms.
- Provide technical leadership and mentoring to SREs and platform engineers, collaborate with data, AI/ML, application, and security teams, and embed reliability and security-by-design principles from initial architecture through production operations.
Experience & skills
- Demonstrate 10+ years of experience in SRE, DevOps, or platform engineering roles, including 5+ years designing and operating workloads on Microsoft Azure at scale across enterprise environments.
- Exhibit strong hands-on expertise with Azure Data and AI services including Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services in production environments.
- Show deep proficiency with containers and Kubernetes including Azure Kubernetes Service (AKS), covering autoscaling, cluster upgrades, production operations, and resilient hybrid deployments.
- Apply advanced Infrastructure-as-Code capabilities using Terraform, Bicep, and Ansible, along with scripting and programming in Python and/or PowerShell, with additional exposure to Go and Bash for automation and operational tooling.
- Implement robust observability practices using Azure Monitor, Log Analytics, Application Insights, Prometheus, and Grafana, and design monitoring, alerting, and tracing frameworks in production systems.
- Prove experience implementing SRE practices including SLOs/SLIs, error budgets, capacity planning, and cost/performance optimisation, with measurable impact on availability and operational efficiency.
- Demonstrate familiarity with hybrid networking, identity, and security including ExpressRoute/VPN, private endpoints, Azure AD, key management, security scanning, and policy-as-code within DevSecOps environments.
- Operate within Agile/Scrum and ITIL processes, contribute to ISO 27001 compliance and external audits, and leverage AI-assisted orchestration and autonomous agents to streamline infrastructure management and automate security compliance.