Virtusa

Data Architect

Permanent
Dubai, United Arab Emirates
Experience 2 - 5 yrs
Urgent

Apply now

Report job as expired

Job expiry date: 05/06/2025

Return to jobs page

Job overview

Date posted
21/04/2025
Location
Dubai, United Arab Emirates
Experience
2 - 5 yrs
Seniority
Senior & Lead
Qualification
Bachelors degree
Expiration date
05/06/2025

Job description

As a Pyspark - Data Architect, you will be responsible for designing, developing, and maintaining high-performance ETL pipelines on the Cloudera Data Platform. The role involves ingesting and transforming large datasets, optimizing pipeline performance, ensuring data integrity, and supporting business analytics requirements. You will collaborate with cross-functional teams, implement orchestration using Apache Oozie or Airflow, and perform routine maintenance and documentation to support scalable data infrastructure and analytical workflows.

Required skills

pyspark

cloudera data platform

etl pipeline development

data ingestion

data transformation

performance optimization

data validation

apache oozie

airflow

cloudera manager

hive

impala

hdfs

hbase

sql

hadoop

kafka

linux scripting

data warehousing

Key responsibilities

Design, develop, and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform.
Implement and manage data ingestion processes from various sources to the data lake or warehouse on CDP.
Use PySpark to process, cleanse, and transform large datasets into analytical formats.
Conduct performance tuning of PySpark code and Cloudera components.
Implement data quality checks, monitoring, and validation routines.
Automate data workflows using Apache Oozie, Airflow, or similar orchestration tools.
Monitor pipeline performance and troubleshoot issues.
Perform routine maintenance on the Cloudera Data Platform.
Collaborate with data engineers, analysts, and product managers to support data-driven initiatives.
Maintain documentation of data engineering processes, code, and configurations.

Experience & skills

Bachelor’s or Master’s degree in Computer Science, Data Engineering, Information Systems, or a related field.
3+ years of experience as a Data Engineer with a focus on PySpark and Cloudera Data Platform.
Advanced proficiency in PySpark including RDDs, DataFrames, and optimization techniques.
Strong experience with Cloudera CDP components such as Cloudera Manager, Hive, Impala, HDFS, and HBase.
Knowledge of data warehousing concepts and SQL-based tools.
Familiarity with Hadoop, Kafka, and distributed computing tools.
Experience with orchestration tools like Apache Oozie or Airflow.
Strong scripting skills in Linux.

Apply now

Return to jobs page

Share this posting