Lead Site Reliability Engineer - Azure & Hybrid Platforms
by G42 in Artificial Intelligence
The Lead Site Reliability Engineer - Azure & Hybrid Platforms at Inception, a G42 company based in Abu Dhabi, is responsible for owning the reliability, observability, and automation of Azure and hybrid environments including Azure Stack and on-prem platforms. The role leads Site Reliability Engineering practices for AI, data, and business-critical application services, driving a cloud-agnostic DevSecOps toolchain and ensuring platforms are secure, scalable, resilient, and cost-efficient. The position requires deep expertise in Microsoft Azure at scale, Azure Data and AI services such as Azure Synapse Analytics, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Machine Learning, Azure OpenAI Service, and Azure Cognitive Services, as well as strong hands-on experience with containers and Kubernetes including Azure Kubernetes Service (AKS), autoscaling, upgrades, and production operations. The engineer builds and maintains end-to-end observability using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, and central log platforms, implementing metrics, logs, traces, dashboards, alerts, and low-noise alerting across Azure and on-prem environments. The role leads Infrastructure-as-Code and automation initiatives using Terraform, Bicep, Ansible, and scripting in Python, PowerShell, Bash and Go, driving self-healing systems, runbook-driven operations, AI-assisted orchestration, and autonomous agents to automate security compliance and infrastructure management. Responsibilities include owning SLOs/SLIs, error budgets, capacity planning, cost/performance optimisation, GPU/accelerator workload optimisation, and ensuring hybrid networking and secure connectivity using ExpressRoute/VPN, private endpoints, Azure AD, and key management. The Lead SRE manages P0/P1 incidents, on-call rotations, blameless post-mortems, and long-term reliability improvements using ITSM and DevSecOps tools such as ServiceNow, Jira, ManageEngine, cloud-agnostic CI/CD, security scanning, and policy-as-code, while operating within Agile/Scrum and ITIL processes and supporting ISO 27001 compliance and external audits. Success is defined by achieving 99.9%+ availability, MTTD < 5 minutes, MTTR < 15–30 minutes for P0 incidents, approximately 50% reduction in manual toil through automation and self-service, and documented and tested DR/BCP for AI, data, and application platforms.