
G42
Lead Data Engineer – AI/LLM Data Platforms & RAG Systems
- Permanent
- Abu Dhabi, United Arab Emirates
- Experience 5 - 10 yrs
- Urgent
Job expiry date: 16/05/2026
Job overview
Date posted
01/04/2026
Location
Abu Dhabi, United Arab Emirates
Salary
AED 30,000 - 40,000 per month
Compensation
Job description
Inception, a G42 company and regional innovator of AI-powered domain-specific and industry-agnostic products built on research and development, is seeking a Lead Data Engineer to architect and build scalable, cloud-native data and AI pipelines supporting enterprise LLM, RAG, and retrieval systems. The role involves designing scalable data pipelines for AI and LLM workloads including vectorization and embedding processing, developing ETL/ELT workflows for structured, unstructured, and streaming data, and creating vector database indexing and similarity search pipelines using FAISS, Pinecone, Weaviate, Qdrant, and Chroma. The Lead Data Engineer will build retrieval systems supporting RAG, semantic search, and enterprise knowledge retrieval, while developing reusable orchestration pipelines using Apache Airflow, Apache Spark, and distributed processing frameworks. The position requires architecting multi-cloud data pipelines across Azure as primary cloud platform, with AWS and GCP integration, and optimizing storage and processing across SQL, NoSQL, and vector databases including MongoDB, DynamoDB, and Cosmos DB. The role includes contributing to event-driven architecture design, collaborating with AI teams to enable embedding generation and LLM integration, ensuring data quality, monitoring, reliability, and observability, and leading system design for large-scale distributed data platforms. Additional responsibilities include integrating embedding generation pipelines via Hugging Face, OpenAI, or other model providers, implementing containerized deployments using Docker, working with Kubernetes environments, applying DevOps practices, CI/CD pipelines, version control, and observability best practices. The role also requires experience building RAG pipelines in production, knowledge of graph databases and hybrid search systems, understanding model deployment, inference optimization, caching techniques for LLM workloads, and applying data governance, IAM, and security patterns across cloud ecosystems to support enterprise-grade AI infrastructure.
Required skills
Key responsibilities
- Design, architect, and optimize scalable cloud-native data pipelines for AI and LLM workloads including vectorization, embedding processing, and large-scale data ingestion while ensuring compatibility with enterprise retrieval systems, semantic search platforms, and RAG pipelines across distributed data infrastructure environments.
- Develop and maintain ETL/ELT workflows for structured, unstructured, and streaming data by implementing scalable data ingestion, transformation, and processing frameworks using Python, SQL, Apache Spark, and distributed processing tools while ensuring data quality, performance optimization, and reliability.
- Create and manage vector database indexing and similarity search pipelines using FAISS, Pinecone, Weaviate, Qdrant, and Chroma, ensuring efficient embedding storage, vectorization workflows, semantic search optimization, and retrieval performance for enterprise knowledge systems.
- Build and implement retrieval systems for RAG, semantic search, and enterprise knowledge retrieval by collaborating with AI engineering teams, enabling embedding generation pipelines, integrating LLM models, and optimizing data pipelines for production-grade AI infrastructure.
- Architect and manage multi-cloud data platforms across Azure as primary cloud platform while supporting AWS and GCP integration, ensuring scalability, availability, reliability, and cost-efficient processing across distributed cloud-native environments.
- Develop robust orchestration and scheduling pipelines using Apache Airflow, Apache Spark, and equivalent distributed frameworks while implementing event-driven architectures, automation pipelines, and reusable data processing components.
- Ensure end-to-end data quality, monitoring, observability, and reliability by implementing logging, alerting, performance monitoring, and operational dashboards while applying DevOps, CI/CD pipelines, and version control best practices.
- Lead system design and architecture for large-scale distributed AI data platforms including model-serving pipelines, embedding pipelines, caching strategies, inference optimization, and integration of SQL, NoSQL, and vector database technologies.
Experience & skills
- Demonstrate 8+ years of progressive experience in data engineering, distributed systems, or AI/ML data infrastructure with hands-on expertise designing scalable data pipelines, cloud-native architectures, and enterprise-grade AI data platforms.
- Possess strong expertise in Python for data processing, APIs, automation, distributed workloads, and pipeline development alongside advanced SQL knowledge and experience working with NoSQL databases including MongoDB, DynamoDB, and Cosmos DB.
- Demonstrate hands-on experience working with vector databases such as FAISS, Pinecone, Weaviate, Qdrant, and Chroma, including vector indexing, embedding pipelines, similarity search, and semantic retrieval systems.
- Show strong understanding of AI infrastructure including vectorization, embeddings, similarity search techniques, LLM architectures, embedding models, and RAG pipeline concepts in enterprise production environments.
- Demonstrate cloud expertise with Azure as primary platform and familiarity with AWS and GCP, including experience deploying containerized workloads using Docker and working with Kubernetes orchestration environments.
- Possess experience with orchestration and big data technologies including Apache Airflow, Apache Spark, distributed processing frameworks, event-driven architecture, and modern data platform engineering.
- Demonstrate strong system design and architecture fundamentals including scalable distributed systems, DevOps practices, CI/CD pipelines, version control, observability, and reliability engineering best practices.
- Show experience building RAG pipelines in production, knowledge of graph databases or hybrid search systems, understanding model deployment and inference optimization, caching strategies for LLM workloads, and familiarity with data governance, IAM, and cloud security patterns.