Lead Data Engineer – AI/LLM Data Platforms & RAG Systems
by G42 in Artificial Intelligence
Inception, a G42 company and regional innovator of AI-powered domain-specific and industry-agnostic products built on research and development, is seeking a Lead Data Engineer to architect and build scalable, cloud-native data and AI pipelines supporting enterprise LLM, RAG, and retrieval systems. The role involves designing scalable data pipelines for AI and LLM workloads including vectorization and embedding processing, developing ETL/ELT workflows for structured, unstructured, and streaming data, and creating vector database indexing and similarity search pipelines using FAISS, Pinecone, Weaviate, Qdrant, and Chroma. The Lead Data Engineer will build retrieval systems supporting RAG, semantic search, and enterprise knowledge retrieval, while developing reusable orchestration pipelines using Apache Airflow, Apache Spark, and distributed processing frameworks. The position requires architecting multi-cloud data pipelines across Azure as primary cloud platform, with AWS and GCP integration, and optimizing storage and processing across SQL, NoSQL, and vector databases including MongoDB, DynamoDB, and Cosmos DB. The role includes contributing to event-driven architecture design, collaborating with AI teams to enable embedding generation and LLM integration, ensuring data quality, monitoring, reliability, and observability, and leading system design for large-scale distributed data platforms. Additional responsibilities include integrating embedding generation pipelines via Hugging Face, OpenAI, or other model providers, implementing containerized deployments using Docker, working with Kubernetes environments, applying DevOps practices, CI/CD pipelines, version control, and observability best practices. The role also requires experience building RAG pipelines in production, knowledge of graph databases and hybrid search systems, understanding model deployment, inference optimization, caching techniques for LLM workloads, and applying data governance, IAM, and security patterns across cloud ecosystems to support enterprise-grade AI infrastructure.