
Virtusa
Data Architect
- Permanent
- Dubai, United Arab Emirates
- Experience 2 - 5 yrs
- Urgent
Report job as expired
Job expiry date: 05/06/2025
Job overview
Date posted
21/04/2025
Location
Dubai, United Arab Emirates
Experience
2 - 5 yrs
Seniority
Senior & Lead
Qualification
Bachelors degree
Expiration date
05/06/2025
Job description
As a Pyspark - Data Architect, you will be responsible for designing, developing, and maintaining high-performance ETL pipelines on the Cloudera Data Platform. The role involves ingesting and transforming large datasets, optimizing pipeline performance, ensuring data integrity, and supporting business analytics requirements. You will collaborate with cross-functional teams, implement orchestration using Apache Oozie or Airflow, and perform routine maintenance and documentation to support scalable data infrastructure and analytical workflows.
Required skills
Key responsibilities
- Design, develop, and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform.
- Implement and manage data ingestion processes from various sources to the data lake or warehouse on CDP.
- Use PySpark to process, cleanse, and transform large datasets into analytical formats.
- Conduct performance tuning of PySpark code and Cloudera components.
- Implement data quality checks, monitoring, and validation routines.
- Automate data workflows using Apache Oozie, Airflow, or similar orchestration tools.
- Monitor pipeline performance and troubleshoot issues.
- Perform routine maintenance on the Cloudera Data Platform.
- Collaborate with data engineers, analysts, and product managers to support data-driven initiatives.
- Maintain documentation of data engineering processes, code, and configurations.
Experience & skills
- Bachelorās or Masterās degree in Computer Science, Data Engineering, Information Systems, or a related field.
- 3+ years of experience as a Data Engineer with a focus on PySpark and Cloudera Data Platform.
- Advanced proficiency in PySpark including RDDs, DataFrames, and optimization techniques.
- Strong experience with Cloudera CDP components such as Cloudera Manager, Hive, Impala, HDFS, and HBase.
- Knowledge of data warehousing concepts and SQL-based tools.
- Familiarity with Hadoop, Kafka, and distributed computing tools.
- Experience with orchestration tools like Apache Oozie or Airflow.
- Strong scripting skills in Linux.