Data Engineer Big

Location:

Columbus, OH

Posted:

February 10, 2025

Contact this candidate

Resume:

RAHUL

Email: ******.******@*****.*** PH: 513-***-**** Sr. Data Engineer

Professional Summary

Highly Experienced Data Engineer with over 8+ years of expertise in developing end-to-end data pipelines, leveraging PySpark, Python, and AWS services, with specialization in Distributed Systems Architecture and Parallel Processing Frameworks.

Proficient in crafting complex SQL queries and generating comprehensive reports and dashboards.

Proficiency in implementing and deploying workloads on Azure VM and designing and deploying Enterprise Data Lakes for various use cases.

Extensive experience in the Hadoop ecosystem, including Spark, Kafka, HBase, Scala, Pig, Hive, Sqoop, Oozie, and other big data technologies.

In-depth knowledge of the Snowflake Database, Schema, and Table structures.

Skilled in assessment using tools like Azure Migrate and Cloudamize.

Proficient in creating data pipelines using Apache Airflow in GCP for ETL jobs and utilizing GCP Dataproc, Data Flow, GCS, Cloud Functions, and BigQuery.

Strong background in Airflow, automating daily data import tasks, and developing Spark applications in Databricks for data extraction, transformation, and aggregation.

Expertise in transferring on-premises ETLs to GCP using cloud-native tools such as Composer, BigQuery, Cloud Data Proc, and Google Cloud Storage.

Created and maintained reusable Databricks notebooks using SQL, Python, and PySpark to streamline the sourcing, transformation, and storage of data, focusing on simplification and operational efficiency.

Worked with various Spark technologies, including Spark Streaming, Core Spark API, and utilized Cloudera Manager for Hadoop cluster management.

Created insightful visualizations and reports using Power BI, enabling stakeholders to make data-driven decisions by providing real-time insights into key supply chain metrics and enterprise performance.

Developed Python scripts for parsing XML documents and loading data into databases.

Proficient in scripting technologies like Python and UNIX shell scripts.

Extensive knowledge of Snowflake Clone and Time Travel features. -

Utilized Hive, SparkSQL, and PySpark for data loading and transformation.

In-depth interest in exploring the latest technologies added to the Google Cloud Platform (GCP).

Proficient in working with MapReduce programs in Apache Hadoop for Big Data processing.

Strong knowledge of Azure Active Directory, integrating Azure AD with Windows-based AD, and integrating applications with Azure AD.

Developed a Python Kafka consumer API for data ingestion from Kafka topics.

Worked with various IDE tools such as My Eclipse, RAD, IntelliJ, and NetBeans.

Proficient in handling various data sources, including flat files, XML, JSON, CSV, Avro, Parquet files, and databases.

Experienced in managing Azure services and subscriptions through Azure portals and Azure Resource Manager.

Skilled in database design, entity relationships, database analysis, SQL programming, PL/SQL stored procedures, packages, and triggers in Oracle.

Proficient in Scrum/Agile framework and Waterfall project management methodologies.

Technical Skills:

Hadoop Ecosystem: HDFS, YARN, PIG, MapReduce, Hive, Sqoop, Spark, Yarn, Zookeeper, Oozie, Kafka.

Programming Languages: Python, PySpark, Java, Scala, Shell Scripting

Big Data Platforms: Hortonworks, Cloudera

Public Cloud: AWS, GCP, Azure (Databricks & Data Lake)

Databases: Netezza, MySQL, UDB, HBase, MongoDB, Cassandra, Snowflake

Data Visualization: Tableau, BO Reports, Splunk, Microsoft SQL Server, Power BI.

Professional Experience

Sr. Data Engineer

US Bank, Cincinnati, Ohio June 2023 to Present

Responsibilities:

Developed Spark applications in Python to handle data from various RDBMS and streaming sources.

Created scalable distributed data solutions with Hadoop.

Integrated Spring Circuit breaker pattern and Hystrix dashboard for monitoring Spring microservices.

Built data pipelines in GCP using Apache Airflow for ETL jobs with various airflow operators.

Migrated on-premises applications to AWS using various services.

Maintained a Hadoop cluster on GCP using Google Cloud Storage, Big Query, and DataProc.

Developed Big Data pipelines for new subject areas using SQL Hive, UNIX, and other Hadoop tools.

Configured Azure load balancers and worked with Hive for data analysis.

Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.

Utilized the Cloud Shell SDK in GCP to configure Data Proc, Storage, and Big Query services.

Implemented event-based triggering, cloud monitoring, and alerting in the GCP environment.

Loaded data into Big Query using Python and GCP Cloud functions.

Created a NiFi dataflow for data ingestion from Kafka, transformation, storage in HDFS, and running Spark Streaming jobs.

Worked with Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, and Spark Streaming.

Developed a Python Kafka consumer API for consuming data from Kafka topics.

Configured Azure blob storage and Azure file servers.

Created pre-processing jobs to flatten JSON documents into a flat file using Spark Data Frames.

Designed GCP Cloud Composer DAG to load data from on-premises CSV files into GCP Big Query Tables.

Developed and deployed end-to-end data pipelines on Azure Data Lake (Gen 2), optimizing data ingestion, processing, and storage for both operational and analytical use cases in the supply chain domain.

Collaborated with Spark to enhance performance and optimize Hadoop algorithms.

Set up Snow pipe to pull data from Google Cloud buckets into Snowflake tables.

De-normalized data for Netezza transformation and loaded it into NoSQL databases and MySQL.

Evaluated competitive vendors in the Big Data Hadoop enterprise solution.

Designed and developed Microservices using Spring Boot.

Created ETL programs in Netezza to load data into the data warehouse.

Supported existing GCP Data Management Implementation.

Managed physical and logical data structures and metadata.

Implemented Microservices-based Cloud Architecture on AWS Platform.

Demonstrated knowledge of Cassandra architecture, replication strategy, gossip, snitches, and more.

Utilized Hive QL to work with partitioned and bucketed data and ran Hive queries on Parquet tables.

Used Apache Kafka to collect web log data and make it available for downstream analysis.

Planned and executed Azure Storage migration, including Blob Storage and Table Storage.

Contributed to Kafka security implementation.

Environment: Spark, Spark-Streaming, Spark SQL, GCP, AWS, MapR, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, PySpark, Shell scripting,Unix, Linux, MySQL, Jenkins, Eclipse, Azure, Oracle, Git, Oozie, Tableau, SOAP, Agile Methodologies.

Sr. Data Engineer

General Motors, Detroit, Michigan December 2021 to May 2023 Responsibilities:

Installed and configured a multi-node cloud cluster on Amazon Web Services (AWS) using EC2.

Managed AWS Management Tools like CloudWatch and CloudTrail for log file storage in AWS S3.

Developed Spark applications extensively using Spark Data Frames and Spark SQL API.

Integrated Spring Boot microservices for processing messages within a Kafka cluster.

Designed and implemented GCP-driven data solutions for enterprise data warehouse and data lakes.

Created data pipelines in GCP using Apache Airflow with various operators.

Developed real-time data movement using Spark Structured Streaming and Kafka.

Automated real-time Tableau refresh with AWS Lambda functions.

Created a Java interface for automated EMS component generation.

Migrated REST APIs from AWS (Lambda, API Gateway) to a Microservices architecture using Docker and Kubernetes (GCP GKE).

Utilized Databricks to build scalable data pipelines, process large datasets with Apache Spark, and integrate Delta Lake for efficient data storage

Stored data files in Google Cloud Buckets, utilizing DataProc and Big Query for GCP cloud-based solutions.

Facilitated data for Tableau dashboards based on Hive tables for business users.

Integrated on-premises DC with Azure DC.

Deployed Spark jobs on GCP DataProc clusters.

Worked with Cloudera Hadoop and the AWS stack for Hadoop.

Knowledge of SPARK APIs, AWS Glue, and the AWS CI/CD pipeline.

Utilized GitLab CI/CD for deployment, with exposure to Jenkins and Terraform.

Developed applications in Core Java 8, Spring Boot, Spring, Hibernate, and Kubernetes.

Created an AWS CloudFormation script for environment creation.

Used Hive scripts to create tables, load data, and analyze it.

Implemented PowerShell scripting for Azure service management.

Transferred data from AWS S3 to AWS Redshift using EMR, Glue, and Spark for analysis.

Worked on Dimensional and Relational Data Modelling with Star and Snowflake Schemas.

Utilized Apache Airflow for monitoring multi-stage ML workflows with Amazon Sage Maker tasks.

Created predictive analytics reports with Python and Tableau.

Environment: Python, PySpark, Spring Boot, Spark SQL, Flask, HDFS, Hive, GitHub, Oozie, Scala, HQL, Jenkins, SQL, AWS Cloud, Azure, GCP, S3, EC2.

Data Engineer

DaVita HealthCare, Denver, Colorado October 2019 to November 2021 Responsibilities:

Gathered data and business requirements, designing data migration solutions from Data Warehouse to Atlas Data Lake (Big Data).

Developed Azure ARM templates and deployed using VSTS for Azure infrastructure provisioning.

Analyzed massive data volumes and validated dataflow using SQL and HIVE scripts.

Developed Java Design Patterns such as factory pattern, MVC, and singleton pattern.

Built ETL pipelines and processed data using Big Data technologies.

Configured Spring Boot microservices, Docker containers on Amazon EC2.

Automated dynamic scans for Java and .NET applications using IBM AppScan.

Utilized AWS Step Functions to automate and orchestrate Amazon Sage Maker tasks.

Integrated Apache Airflow with AWS for monitoring multi-stage ML workflows.

Developed PL/SQL statements, stored procedures, functions, triggers, and views.

Utilized Tableau, Power BI, and Cognos for data validation reports.

Created Azure Backup and recovery procedures.

Implemented PowerShell scripts in Azure Automation for issue resolution.

Performed statistical analysis using SQL, Python, R Programming, and Excel.

Worked with SQL, HIVE, and PIG to import, clean, filter, and analyze data.

Extracted, transformed, and loaded data from transaction systems using Python and SAS.

Installed, configured, and administered Azure, IAAS, and Azure AD.

Analyzed and recommended changes to improve data consistency and efficiency.

Designed and developed data mapping procedures for ETL.

Managed multiple Azure subscriptions using PowerShell and Azure Portal.

Implemented Pester for validating Azure Resources.

Created Azure Blob Storage and Azure File Servers.

Transferred data from AWS S3 to AWS Redshift, MongoDB, T-SQL, and SQL Server.

Built CI/CD pipelines for testing and production environments using Terraform.

Developed using Core Java 8, Spring Boot, Spring, Hibernate, Web Services, Kubernetes, Swagger, Docker.

Created AWS CloudFormation scripts.

Used Hive scripts for table creation, data loading, and analysis.

Leveraged Spring Boot for cloud Microservices development.

Created predictive analytics reports with Python and Tableau.

Environment: AWS, Python, PySpark, Spring Boot, Tableau, R Programming, Pig, SQL, NumPy, Azure, Linux, HDFS, JSON, ETL, Snowflake, Power BI, Hive, Sqoop, Hub.

Data Engineer

Southwest Airlines, Dallas, Texas April 2018 to September 2019 Responsibilities:

Led requirements analysis, application design, coding, testing, maintenance, and support.

Developed stored procedures, functions, triggers, packages, and SQL scripts.

Brought Big Data Hadoop Platform as a comprehensive solution.

Developed enterprise Java beans like Session and entity beans.

Created complex SQL queries using views, subqueries, and correlated subqueries.

Performed AWS service architecture and implementation assessments.

Developed Oozie workflows to automate data loading and preprocessing.

Created Pig UDFs for data pre-processing.

Designed Hive tables, loaded data, and wrote Hive queries for map-reduce processing.

Worked on Zookeeper Cluster Coordination Services.

Built Oozie workflows and automated processes in the Cloudera environment.

Developed Python Kafka consumer API for data ingestion.

Exported analyzed data to relational databases using Sqoop.

Migrated SQL scripts from Redshift and Athena.

Created CI/CD pipelines for testing and production environments using Terraform.

Proficient with Docker, Kubernetes, and Terraform.

Contributed to the PySpark API for large dataset processing.

Analyzed data using Hive queries and presented it using Tableau dashboards.

Created scripts for automated Hive, Spark SQL, Pig, and Sqoop job scheduling.

Managed Tableau dashboards and reports.

Created database constraints, indexes, views, stored procedures, and triggers.

Collaborated with developers to fine-tune queries, run scripts, and migrate databases.

Developed shell scripts for SQL script invocation and scheduling.

Developed unit test cases and automated regression scripts using Python.

Environment: Hadoop, Python, PySpark, HDFS, Java, MapReduce, Hive, Sqoop, Spark SQL, HQL, Oozie, Git, Oracle, Pig, Cloudera, Agile, T-SQL, AWS Redshift.

Hadoop SQL Developer

Brio Technologies Private Limited Hyd India June 2016 to December 2017 Responsibilities:

Installed and configured SQL Server 2005, working on the development and optimization of a new Loans database.

Built scalable distributed data solutions using Hadoop.

Installed and configured Hive, Pig, Sqoop, and Oozie on the Hadoop cluster.

Developed and designed ETL jobs to load data from multiple source systems to Teradata Database.

Worked on data extraction, aggregation, and analysis in HDFS using PySpark and stored the data in Hive.

Created and deployed SSIS 2005 packages and reports using SSRS 2005.

Worked extensively on enrichment/ETL in real-time stream jobs using Spark Streaming, Spark SQL, and loaded data into HBase.

Developed and tested Spark code using Scala for Spark Streaming/Spark SQL for faster data processing.

Developed Microservices (REST APIs) using Java and Spring Boot to support Citi NGA cloud framework and deployed the Microservices on Pivotal Cloud Foundry.

Developed frontend and backend modules using Python on Django Web Framework.

Configured database maintenance plans for backup and database optimization.

Developed users, user groups, and access permissions.

Wrote and executed various MySQL database queries from Python using Python MySQL connector and MySQL dB package.

Used performance monitor, SQL profiler, and DBCC to tune performance.

Automated the ETL process from different data sources to SQL Server using SQL Server Agent.

Created constraints, indexes, views, stored procedures, and triggers.

Collaborated with developers to fine-tune queries, run scripts, and migrate databases.

Developed simple to complex Map/Reduce Jobs using Hive and Pig.

Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.

Created backup and recovery procedures for databases on production, test, and development servers and assisted in the resolution of production issues.

Environment: Hadoop, Hive, MapReduce, Windows 2000 Advanced Server/Server 2003, MS SQL Server 2005 and 2000, T-SQL, ETLSQL

Education Details:

ICFAI Foundation for Higher Education Coumputer Science Engineering

July 2012– June 2016

Hyderabad, Telangana, India

Contact this candidate