Rahul Dastidar is a seasoned Data Engineering Consultant with over 7 years of experience in cloud and distributed computing. He specializes in PySpark development, data pipelines, and Master Data Management (MDM) with Reltio. Rahul has built scalable pipelines on AWS EMR and Databricks, integrating enterprise data with Reltio's MDM platform. He is adept at API-driven integration and performance optimization of Spark workloads, ensuring enterprise-grade data governance and reliability.
Designed and implemented scalable PySpark pipelines on Databricks and AWS EMR.
Achieved up to 35% reduction in execution time and 20% cost savings through optimization.
Automated end-to-end CI/CD for PySpark pipelines using GitHub Actions & Jenkins.
Established data governance frameworks ensuring 99% data integrity.
Built and optimized PySpark pipelines handling 1B+ records with 40% faster processing.
Successfully integrated enterprise datasets into Reltio MDM, enabling unified customer 360 view.
Improved data survivorship and match/merge accuracy by 25% through advanced rule configuration.
Key outcomes:
Designed and implemented PySpark pipelines for multi-terabyte datasets.
Optimized Spark jobs, reducing execution time by 35%.
Automated ingestion pipelines using AWS Lambda and Step Functions.
Key outcomes:
Led migration of customer master data into Reltio MDM.
Built PySpark ETL workflows for master data cleansing.
Ensured 99% data integrity through governance frameworks.
Key outcomes:
Architected big data pipelines on AWS EMR for real-time data ingestion.
Developed ETL jobs integrating customer data with MDM systems.
Tuned Spark jobs for 20% cost efficiency.
Key outcomes:
Designed data workflows with PySpark and Hive to improve reporting accuracy.
Configured ELK-based monitoring for Spark job failures.
Collaborated on data modeling and cleansing strategies.
Rahul Dastidar
PySpark