G42

Lead Data Engineer – AI/LLM Data Platforms & RAG Systems

Permanent
Abu Dhabi, United Arab Emirates
Experience 5 - 10 yrs
Urgent

Apply now

View more jobs like this

Job expiry date: 16/05/2026

Return to jobs page

Job overview

Date posted
01/04/2026
Location
Abu Dhabi, United Arab Emirates
Salary
AED 30,000 - 40,000 per month
Compensation

Job description

Inception, a G42 company and regional innovator of AI-powered domain-specific and industry-agnostic products built on research and development, is seeking a Lead Data Engineer to architect and build scalable, cloud-native data and AI pipelines supporting enterprise LLM, RAG, and retrieval systems. The role involves designing scalable data pipelines for AI and LLM workloads including vectorization and embedding processing, developing ETL/ELT workflows for structured, unstructured, and streaming data, and creating vector database indexing and similarity search pipelines using FAISS, Pinecone, Weaviate, Qdrant, and Chroma. The Lead Data Engineer will build retrieval systems supporting RAG, semantic search, and enterprise knowledge retrieval, while developing reusable orchestration pipelines using Apache Airflow, Apache Spark, and distributed processing frameworks. The position requires architecting multi-cloud data pipelines across Azure as primary cloud platform, with AWS and GCP integration, and optimizing storage and processing across SQL, NoSQL, and vector databases including MongoDB, DynamoDB, and Cosmos DB. The role includes contributing to event-driven architecture design, collaborating with AI teams to enable embedding generation and LLM integration, ensuring data quality, monitoring, reliability, and observability, and leading system design for large-scale distributed data platforms. Additional responsibilities include integrating embedding generation pipelines via Hugging Face, OpenAI, or other model providers, implementing containerized deployments using Docker, working with Kubernetes environments, applying DevOps practices, CI/CD pipelines, version control, and observability best practices. The role also requires experience building RAG pipelines in production, knowledge of graph databases and hybrid search systems, understanding model deployment, inference optimization, caching techniques for LLM workloads, and applying data governance, IAM, and security patterns across cloud ecosystems to support enterprise-grade AI infrastructure.

Required skills

Python

SQL

NoSQL databases

MongoDB

DynamoDB

Cosmos DB

Vector databases

FAISS

Pinecone

Weaviate

Qdrant

Chroma

Apache Airflow

Apache Spark

Azure

AWS

GCP

Docker

Kubernetes

CI/CD

Key responsibilities

Design, architect, and optimize scalable cloud-native data pipelines for AI and LLM workloads including vectorization, embedding processing, and large-scale data ingestion while ensuring compatibility with enterprise retrieval systems, semantic search platforms, and RAG pipelines across distributed data infrastructure environments.
Develop and maintain ETL/ELT workflows for structured, unstructured, and streaming data by implementing scalable data ingestion, transformation, and processing frameworks using Python, SQL, Apache Spark, and distributed processing tools while ensuring data quality, performance optimization, and reliability.
Create and manage vector database indexing and similarity search pipelines using FAISS, Pinecone, Weaviate, Qdrant, and Chroma, ensuring efficient embedding storage, vectorization workflows, semantic search optimization, and retrieval performance for enterprise knowledge systems.
Build and implement retrieval systems for RAG, semantic search, and enterprise knowledge retrieval by collaborating with AI engineering teams, enabling embedding generation pipelines, integrating LLM models, and optimizing data pipelines for production-grade AI infrastructure.
Architect and manage multi-cloud data platforms across Azure as primary cloud platform while supporting AWS and GCP integration, ensuring scalability, availability, reliability, and cost-efficient processing across distributed cloud-native environments.
Develop robust orchestration and scheduling pipelines using Apache Airflow, Apache Spark, and equivalent distributed frameworks while implementing event-driven architectures, automation pipelines, and reusable data processing components.
Ensure end-to-end data quality, monitoring, observability, and reliability by implementing logging, alerting, performance monitoring, and operational dashboards while applying DevOps, CI/CD pipelines, and version control best practices.
Lead system design and architecture for large-scale distributed AI data platforms including model-serving pipelines, embedding pipelines, caching strategies, inference optimization, and integration of SQL, NoSQL, and vector database technologies.

Experience & skills

Demonstrate 8+ years of progressive experience in data engineering, distributed systems, or AI/ML data infrastructure with hands-on expertise designing scalable data pipelines, cloud-native architectures, and enterprise-grade AI data platforms.
Possess strong expertise in Python for data processing, APIs, automation, distributed workloads, and pipeline development alongside advanced SQL knowledge and experience working with NoSQL databases including MongoDB, DynamoDB, and Cosmos DB.
Demonstrate hands-on experience working with vector databases such as FAISS, Pinecone, Weaviate, Qdrant, and Chroma, including vector indexing, embedding pipelines, similarity search, and semantic retrieval systems.
Show strong understanding of AI infrastructure including vectorization, embeddings, similarity search techniques, LLM architectures, embedding models, and RAG pipeline concepts in enterprise production environments.
Demonstrate cloud expertise with Azure as primary platform and familiarity with AWS and GCP, including experience deploying containerized workloads using Docker and working with Kubernetes orchestration environments.
Possess experience with orchestration and big data technologies including Apache Airflow, Apache Spark, distributed processing frameworks, event-driven architecture, and modern data platform engineering.
Demonstrate strong system design and architecture fundamentals including scalable distributed systems, DevOps practices, CI/CD pipelines, version control, observability, and reliability engineering best practices.
Show experience building RAG pipelines in production, knowledge of graph databases or hybrid search systems, understanding model deployment and inference optimization, caching strategies for LLM workloads, and familiarity with data governance, IAM, and cloud security patterns.

Apply now

Return to jobs page

Share job opening, get 1-month free Private Network access (worth 99 AED)

Lead Data Engineer – AI/LLM Data Platforms & RAG Systems

Job overview

Date posted

Location

Salary

Compensation

Job description

Required skills

Key responsibilities

Experience & skills

Experience

Seniority

Qualification

Expiration date