Data Engineer Big

Location:

Edison, NJ

Posted:

February 13, 2025

Contact this candidate

Resume:

VASUDEVARAO G

Data Engineer

Email: **********@*****.***

Contact: 380-***-****

Professional Summary:

Overall, 6 + years of experience as a Data Engineer with cloud platforms like AWS, Azure, GCP, Snowflake and Databricks. Experienced in big data architecture & ecosystem components. Using Spark for data cleansing, data analysis, structured and unstructured data transformations, data modeling, data warehousing, and data visualizations. Proficiency in deploying data pipelines, data products and solutions on cloud technologies using PySpark, SparkSQL, Python, SQL, Airflow, Kafka, SSIS, Sqoop, Oozie, Hive, Tableau and Power BI.

Experienced in building robust spark applications by using Spark-SQl in Databricks for data extraction, transformation and aggregation from multiple file formats & transformed the data to uncover business insights.

Developed PySpark Scripts in Spark Streaming to process huge data from data lakes like Kafka, S3, Kinesis and constructed ETL Pipelines to feed databases and dashboards.

Developed robust ETL pipelines in Azure data factory to move data from on perm to Azure SQL Datawarehouse.

Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS. and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.

Experienced in optimizing Spark efficiency using techniques like persist, cache, broadcast, and efficient joins.

Well versed with Hadoop distribution and ecosystem components like HDFS, YARN, MapReduce, Spark, Sqoop, Hive, and Kafka.

Expertise in dealing with various big data tools such as Hive to build tables, partitioning and bucketing data, developing, and optimizing HiveQL queries.

Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and expertise in using spark-SQL with diverse data sources like JSON, Parquet and Avro.

Experienced in Optimizing the Virtual warehouses and the SQl Query in terms of the cost in Snowflake.

Experienced in creating and scheduling Hadoop ETL workflows using Apache Oozie.

Proficient in building various Airflow DAGs to execute end to end tasks of an ETL job.

Extensive experience in working with Python packages like Pandas and NumPy for data wrangling and numerical computations.

Experienced in working with AWS EMR and using the EMR cluster and various EC2 instance types based on requirements.

Proficient in working with different file formats like ORC, Parquet, AVRO, JSON and flat files.

Experience in modeling data using Star schema, Snowflake schema, transactional modeling.

Experience in working with NoSQL databases and their integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra and HBase.

Expertise in AWS services including S3, EC2, SNS, SQS, RDS, EMR, Kinesis, Lambda, Step Functions, Glue, Redshift, DynamoDB, Elasticsearch, Service Catalog, CloudWatch and IAM.

Expertise in Azure services including Blob Storage, Virtual Machines, Azure Storage Queues, Azure Database, Azure Data Lake Analytics, Azure Event Hubs, Azure Functions Serverless Compute, Azure Data Factory, Azure Synapse Analytics, Azure Cosmos DB, Azure Monitor, Azure Active Directory.

Experience with Azure Stream Analytics for real-time data processing and analysis, Azure Data Lake Storage for large-scale data storage and processing. Implementing the various resources in Azure using Azure portal PowerShell on Azure Resource Manager deployment models.

Expert with the design of custom reports using data extraction and reporting tools like Tableau and Power BI, and development of algorithms based on business cases.

Excellent knowledge of Software Development Life Cycle (SDLC), through understanding of various phases like requirements, analysis/design, development, testing, and deployment.

Experienced with version control systems like Git, GitHub, GitLab, SVN, Bamboo and Bitbucket to manage the versions and configurations of the code organized. Responsible for defect reporting and tracking in JIRA.

Technical Skills:

Hadoop/Big Data Technologies

Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, PySpark, Zookeeper and Cloudera Manager, Kafka, Flume.

Cloud Services

AWS-AWS EMR, EC2, S3, Redshift, Athena, Lambda Functions, Step Functions, Dynamo DB, CloudWatch, CloudTrail, SNS, SQS, Kinesis, Quick Sight.

Azure- Azure HDInsight, Data Bricks (ADBX), Data Lake (ADLS), Cosmos DB, DevOps, Azure AD, Blob Storage, Data Factory, Azure Functions, Azure SQL Datawarehouse, Azure SQL database, Azure Active directory, Azure Monitor, Azure Stream Analytics, Azure Event Hub.

NO SQL Database

HBase, Cassandra, Redis, Cosmo DB, Dynamo DB, Mongo DB.

Databases

Oracle, MY SQL, Teradata, Postgres, Db2.

Hadoop Distribution

Horton Works, Cloudera.

ETL Tools

Informatica, Vertica, Pentaho, Data Stage, SSIS

Programming & Scripting

Python, Scala, PySpark, R SQL, PowerShell, Shell Scripting.

IDE

PyCharm, Visual Studio Code, SSMS, Data Studio, IntelliJ.

Monitoring and Reporting

Tableau, Power Bi

Version Control & CI/CD

GIT, GitHub, Gitlab, Bamboo, Jenkins, Maven, SVN

Operating Systems

Linux, Unix, Mac OS-X, Windows

Professional Experience:

JP Morgan Chase, Columbus, Ohio

Sr. Data Engineer

Feb 2024 - Present

Responsibilities:

Implemented scalable data pipelines in AWS using EMR, EC2, and Glue to process multi-terabyte data sets, achieving significant reductions in processing time.

Designed and developed data transformations using PySpark on AWS Databricks, enhancing the analytics capabilities for financial datasets.

Created a real-time data ingestion system with AWS Lambda and S3, which facilitated efficient data storage and

processing.

Developed Python scripts for automated data quality checks, integrating with AWS services to ensure compliance and data integrity.

Designed data lake solutions on AWS to support compliance and reporting requirements, using PySpark and Scala for data aggregation.

Developed data visualization tools using Scala on AWS Quicksight to provide actionable insights into credit

risk management.

Developed an automated monitoring system using AWS CloudWatch and Lambda, which proactively manages, and scales data processing resources based on real-time analytics workload, ensuring optimal performance and cost efficiency

Designed and executed complex SQL queries on AWS Redshift to perform data analysis and reporting, which supported strategic decision-making by providing deep insights into customer behaviors and market trends.

Developed an automated testing framework using Python to validate data integrity and accuracy across multiple data pipelines, enhancing the reliability of data transformations and load processes

Sam’s Club, Bedford Ohio

Data Engineer May 2022 - Aug 2023

Responsibilities:

DExperience in working with Azure cloud platform (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Synapse, SQL DB, SQL DWH, and Data Storage Explorer).

Involved in building an Enterprise Data Lake using Data Factory and Blob storage, enabling other teams to work with more complex scenarios and ML solutions.

Used Azure Data Factory, SQL API, and Mongo API and integrated data from MongoDB, MS SQL, and cloud

(Blob, Azure SQL DB).

Developed Pyspark scripts for mining data and performed transformations on large datasets to provide real-time insights and reports.

Supported analytical platform, handled data quality, and improved the performance using Python’s higher order functions, lambda expressions, pattern matching and collections.

Performed data cleansing and applied transformations using DataBricks and Spark data analysis. Used Azure Synapse tomanage processing workloads and served data for BI and prediction needs. Designed and

automated Custom-built input adapters using Spark, Sqoop, and Airflow to ingest and analyze data from RDBMS to Azure Datalake.

Reduced access time by refactoring data models, query optimization, and implementing Redis cache to support Snowflake.

Involved in developing automated workflows for daily incremental loads, and moved data from RDBMS to Data Lake.

Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from MS SQL to

CosmosDB and improved the query performance.

Created Automated ETL jobs in Talend and pushed the data to the Snowflake data warehouse.

Managed resources and scheduling across the cluster using Azure Kubernetes Service.

Used Azure DevOps for CI/CD, debugging, and monitoring jobs and applications. Used Azure Active Directory and Ranger for security.

Worked with the data science team to do preprocessing and feature engineering and assisted machine learning algorithms running in production.

Fine-tuned parameters of Spark NLP applications like batch interval-time, level of parallelism, and memory tuning to

improve the processing time and efficiency.

Facilitated data for interactive Power BI dashboards and reporting purposes.

Nationwide, Columbus,Oh

AWS Data Engineer

June 2018 – April 2022

Responsibilities:

Designed and implemented a multi-tier data architecture on AWS, leveraging S3, Redshift, and RDS for high volume data analytics.

Created ETL frameworks using Python, integrated with AWS Lambda for automated data handling from various data sources.

Developed advanced analytics models on AWS EMR using Spark, providing insightsinto transportation patterns and customer behavior.

Used AWS Databricks for data aggregation, improving quality, and ML preparation.

Designed real-time data ingestion systems with AWS Kinesis and Lambda, optimizing data flows for immediate analysis.

Developed a Scala-based real-time recommendation engine, leveraging AWS technologies to optimize ridesharing matches.

Implemented a data reconciliation framework using AWS Glue and Python, which ensured data accuracy and consistency across different storage platforms, significantly reducing discrepancies in reporting and analytics.

Created automation scripts in Python to streamline the ETL processes, reducing the time required for data extraction, transformation, and loading by 30% while ensuring data consistency.

Utilized SQL for complex data querying and management tasks, optimizing database performance and enabling more efficient data analysis and reporting across cloud-based and on-premises environments.

EDUCATION

Masters in Information Technology Sept 2023 - Jan 2025

Franklin University

Major in Data Analytics

Contact this candidate