Data Engineer

Location:

Cincinnati, OH

Posted:

February 10, 2025

Contact this candidate

Resume:

SURYA J

Email: **************@*****.*** PH: 513-***-****

www.linkedin.com/in/iamjsurya

Sr. Data Engineer

Professional Summary:

Around 10+ years of IT experience in Azure Cloud, specializing in administering, designing complex applications, and developing data solutions, business intelligence reporting, ETL development, testing, and documentation.

Extensive experience in deploying cloud-based apps using Azure services like Blob Storage, Data Factory, Data Lake, as well as AWS services including EC2, S3, CloudWatch, Glue, Lambda, and DynamoDB.

Experience in deploying web applications into AWS cloud, automating configurations using Terraform.

Proficient in Azure Data Factory (ADF) to perform incremental loads from Azure SQL DB to ADF.

Experience in implementing Azure Data Factory (v2) pipeline components such as linked services, datasets, and activities.

Expert in Snowflake for data warehousing, query optimization, performance tuning, and integrating data pipelines using Matillion, dbt, and Snowpipe. Experience in dataset registration in Exchange for S3, One Lake, and Snowflake.

Experienced ServiceNow Professional with deep knowledge of platform features, ITSM best practices, and ServiceNow's Common Service Data Model (CSDM).

Proficient ServiceNow Developer, skilled in scripting and automation using JavaScript, Python, and PowerShell to deliver customized solutions and integrations.

ITSM and CMDB Specialist, experienced in provisioning practices, configuration management, and asset tracking to improve IT operations.

Extensive experience in designing and implementing scalable data pipelines and integrating advanced DataOps platforms like Cognite Data Fusion for industrial data management.

Experience in creating Azure Blob, Azure Data Lake Storage, Azure SQL DB, and Azure Logic Apps.

Experience in migrating on-premises ETL processes to cloud, ensuring high availability and performance.

Deployed ETL pipelines in AWS, integrating Bitbucket and AWS Elastic Beanstalk.

Working experience with Pipelines in ADF, using Linked Services/Datasets/Pipelines to extract and load data from Azure SQL, ADLS, Blob Storage, and Azure SQL Data Warehouse.

Extensive experience with Informatica tools (PowerCenter, IICS, Edge, EDC, IDQ) for complex ETL processes, data quality, and governance.

Experience in Azure Data Factory Control Flow transformations, including ForEach, Lookup, Until, Web Activity, Wait Activity, and If Condition.

Experience with Big Data/Hadoop Ecosystem: Spark, Hadoop, Hive, Kafka, Oozie, and Databricks.

Strong understanding of Spark Architecture, performing batch and real-time streaming operations using Spark Core, SQL, and Structured Streaming.

Experienced in handling large datasets using Spark in-memory processing, partitions, and broadcast variables.

Experienced in cloud provisioning tools such as Terraform and CloudFormation for infrastructure automation.

Proficient in Connect Direct, Dell Boomi, Filezilla, and MQFTE for secure and efficient large-scale data transfers.

Experience in using different data engineering frameworks across Cloudera, AWS, Azure, and GCP.

Experience with Google Cloud Platform (GCP), leveraging BigQuery, Cloud Storage, Dataflow, and Pub/Sub for large-scale data processing.

Performed Hive operations on large datasets, optimizing HiveQL queries using Partitioning, Bucketing, and Windowing.

Experience in real-time streaming, writing Apache Storm topologies to process Kafka events into Cassandra.

Expert in data modeling using Data Vault 2.0 and Kimball methodologies, optimizing architectures for analytics and reporting.

Skilled in Databricks and Delta Live Tables (DLT) to build scalable pipelines within Medallion Architecture.

Snowflake expertise, involved in Dataset Registration in Exchange for S3, OneLake, and Snowflake.

Created ETL jobs using Matillion to load server data into the Snowflake Data Warehouse.

Extensive experience working with AWS services, with an in-depth understanding of cloud-based architectures.

Experience in moving data into and out of HDFS and RDBMS using Apache Sqoop.

Experience in building business intelligence reports using SQL Server Reporting Services (SSRS) and Crystal Reports.

Experience in performing Reverse Engineering of Physical Data Models from data and SQL scripts.

Working extensively on Forward Engineering processes, creating DDL scripts for implementing Data Modeling changes.

Experience with Software Development tools such as JIRA, Play, and Git for Agile project management.

Experienced in using Agile methodologies, including SCRUM, Extreme Programming, and Test-Driven Development (TDD).

Technical Skills

Programming Languages: Python, Scala, Java, SQL, PySpark, Shell Script, R,C++

Big Data Tools: Hadoop, MapReduce, Apache Spark, Hive, Kafka, Oozie, Databricks

Cloud Platforms: Azure Data Bricks, Azure Data Lake Storage, Synapse, HDInsight, S3,EC2, Glue, Lambda, Kinesis, EMR, Snowflake, Terraform, AWS

Data Base: Oracle, MySQL, MongoDB, Cassandra

ETL Tools: Apache NiFi, Apache Airflow, dbt (data build tool), Informatica (IICS, PowerCenter, IDQ), Matillion, Fivetran, Talend, Meltano, AWS Glue, Azure Data Factory, Google Cloud Dataflow, Stitch, Airbyte, Kafka Connect, Pentaho, Dell Boomi

Programming Languages Python, Scala, SQL, Java, C

Scripting Languages: Python, Bash, PowerShell

Python Packages: Pandas, NumPy, Matplotlib, Sklearn

Frameworks: Django

Tools: Rest API, PyCharm, Visual Studio Code, Eclipse, ServiceNow

Visualization Tools: PowerBI, Tableau, Jupyter

Operating Systems: Windows, Linux, MacOS

Professional Experience

Bank Of America, Charlotte, NC Sep 2023 – Till date

Sr. Data Engineer

Responsibilities:

Orchestrated the design, build, and management of ELT data pipelines with Azure Data Factory and Azure Synapse Pipelines.

Orchestrated the design, build, and management of ELT data pipelines with Azure Data Factory, Azure Synapse Pipelines, and other automation tools, ensuring streamlined workflows and high-quality data delivery.

Led the migration of ETL applications from Hortonworks Hadoop to Azure, incorporating ServiceNow CMDB principles for enhanced tracking and provisioning.

Developed and deployed AWS Lambda functions in NodeJS to process and automate data workflows, integrating with SNS/SQS for event-driven architectures and reducing manual processing time by 25%.

Built and optimized CI/CD pipelines using Jenkins and AWS CodePipeline, achieving 95% deployment automation and reducing release cycles from weeks to days.

Worked with AWS services including S3 for data storage, RDS/DynamoDB for database management, CloudWatch for monitoring, and AWS CLI for automation, ensuring seamless cloud-based development.

Enhanced data processing performance by 35% through the integration of PySpark in Databricks and scripting in ServiceNow for workflow automation.

Implemented data validation and quality checks using Databricks Delta Live Tables, ensuring 99% data accuracy and system reliability.

Designed and managed distributed systems architecture, leveraging AWS best practices and implementing microservices for scalable solutions.

Worked extensively with big data file formats such as AVRO, Parquet, CSV, ORC, and JSON, leveraging best practices in IT service management (ITSM).

Developed robust solutions using Azure Databricks, Azure Logic Apps, and ServiceNow CSDM methodologies for data organization and compliance.

Successfully implemented Proof of Concept (POC) in development databases to validate requirements and benchmark automated workflows.

Engineered data quality and validation processes using Databricks Delta Live Tables (DLT) and ServiceNow workflows to ensure reliable performance and accurate data.

Designed and managed end-to-end pipelines for orchestrating data processing jobs while adhering to Agile development methodologies like Scrum and Kanban.

Enhanced existing systems by integrating PySpark in Databricks and ServiceNow scripting for optimized data processing and workflow automation.

Created reconciliation notebooks to validate data integrity between source and destination systems, adhering to CSDM principles.

Built control flow structures in pipelines to handle datasets of varying sizes, using Python, PowerShell, and JavaScript scripting for enhanced flexibility.

Scheduled and monitored Azure Data Factory pipelines and Spark jobs, incorporating ITSM best practices for performance tracking.

Designed ELT/ETL pipelines leveraging Azure SQL Data Warehouse, ServiceNow platform features, and modern data processing frameworks for scalable, reusable solutions.

Assembled ELT/ETL data pipelines using Azure Data Factory, Azure Databricks (Spark, Scala, Python), and Azure SQL Data Warehouse.

Environment: AWS (Lambda, S3, SNS/SQS, RDS/DynamoDB, Neptune, CloudWatch, AWS CLI), Azure (Data Factory, Synapse, Databricks, Logic Apps, Blob Storage) ServiceNow, Logic Apps, SQL Data Warehouse, Spark, MySQL, Hadoop, PySpark, Scala, Python, AWS CloudFormation, Terraform, DynamoDB, Neptune

Centene, St. Louis, MO May 2021 – Aug 2023

Data Engineer

Responsibilities:

As a Data Engineer, responsible for design and development and implementation of Data Pipelines and maintaining the ETL (extract, transform, and load) process in AWS.

Experienced in writing Hive queries for data analysis to meet the business requirements and also created and managed Hive tables based on the requirement.

Running SQL scripts in AWS RDS, creating indexes, stored procedures for data analysis.

Experienced in working with Hadoop and Spark (Python and Scala).

Integrated Cognite Data Fusion (CDF) into existing data workflows to enhance industrial data management capabilities.

Developed data pipelines and contextualized data from various industrial sources using Cognite Data Fusion.

Collaborated with business stakeholders to leverage Cognite's real-time analytics for improving operational efficiency and decision-making.

Configured Cognite APIs for seamless data integration and visualization.

Prepared Scripts in Python and Shell for Automation of administration tasks.

Designed various ETL strategies from various heterogeneous sources (such as flat files, excel, access database, and SQL server) utilizing AWS.

Cleansed data using Spark jobs to make it suitable for ingestion into Hive tables for further analysis.

Experience with Snowflake Multi - Cluster Warehouses.

Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.

Configured EC2 instances by writing Terraform templates.

Implemented Data Vault 2.0 methodology to manage enterprise data warehousing, ensuring flexibility, scalability, and historical data tracking.

Designed and maintained Hubs, Links, and Satellites in the Data Vault model to support business needs and ensure data integrity.

Used Terraform to write IaaS and created scripts for EC2, S3 buckets.

Experience in using Snowflake Clone and Time Travel.

Utilized Hive tables and HQL queries to generate daily and weekly reports. Worked on complex data types in Hive like Structs and Maps.

Used Spark-SQL, Scala on Spark engine to develop end to end ETL pipeline.

Involved in code/design analysis, strategy development and project planning.

Imported all gathered data from different sources into Spark RDD for further data transformations and analysis.

Worked on building scalable distributed data solution using Hadoop.

Experienced on working with vendors to onboard external data into Target S3 buckets.

Utilized Oozie workflow to run multiple Hive jobs independently with time and data availability.

Monitored and controlled Local disk storage and Log files using AWS CloudWatch.

Used Mongo dB to create tables for loading large sets of structured, semi-structured and unstructured data incoming.

Created reports for the BI team using Sqoop to export data into HDFS and Hive.

Exposure to D3.js and Tableau, to explain and communicate data insights, significant features, models scores and performance of new recommendation system to both technical and business teams.

Spearheaded the design, development, and implementation of data pipelines using Azure technologies.

Played a pivotal role in project architecture and application design utilizing cloud and big data solutions on Azure.

Specialized in building enterprise data warehouses and data mart applications from scratch using Azure Synapse Analytics and SQL Server.

Contributed to requirement analysis, gathering, and understanding business needs.

Environment: Python, Hadoop, Spark, Sqoop, Spark SQL, AWS, Cognite Data Fusion (CDF), Hadoop, Hive, Scala, pig, NoSQL, Oozie, MySQL, Tableau.

Amazon Jul 2019 – Apr 2021

Sr. Data Engineer

Responsibilities:

Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, and Spark Yarn.

Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.

Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations and imported data from different sources into Spark RDD for processing and developed custom aggregate functions using Spark SQL and performed interactive querying

Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.

Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.

Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements and involved in managing S3 data layers and databases including Redshift and Postgres.

Processed the web server logs by developing multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis and worked on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.

Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket.

Worked with different file formats like JSON, AVRO and parquet and compression techniques like snappy and developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.

Developed shell scripts for dynamic partitions adding to hive stage table, verifying JSON schema change of source files, and verifying duplicate files in source location.

Worked with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).

Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive structured and unstructured data.

Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes.

Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.

Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snowflake.

Environment: Spark, AWS, EC2, EMR, Hive, SQL Workbench, Tableau, Kibana, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Hadoop (Cloudera Stack), Informatica, Jenkins, Docker, Hue, Spark, Netezza, Kafka, HBase, HDFS, Hive, Pig, Sqoop, Oracle, ETL, AWS S3, AWS Glue, GIT, Grafana.

Truist Bank, Charlotte, NC Mar 2017 – Jun 2019

Data Engineer

Responsibilities:

Running Spark SQL operations on JSON, converting the data into a tabular format with data frames, then saving and publishing the data to Hive and HDFS.

Developing and refining shell scripts for data input and validation with various parameters, as well as developing custom shell scripts to execute spark Jobs.

Creating Spark tasks by building RDDs in Python and data frames in Spark SQL to analyze data and store it in S3buckets.

Working with JSON files, parsing them, saving data in external tables, and altering and improving data for future use.

Taking part in design, code, and test inspections to discover problems throughout the life cycle. At appropriate meetings, explain technical considerations and upgrades to clients.

Creating data processing pipelines by building spark jobs in Scala for data transformation and analysis.

Working with structured and semi-structured data to process data for ingestion, transformation, and analysis of data behavior for storage.

Using the Agile/Scrum approach for application analysis, design, implementation, and improvement as stated by the standards.

Creating and putting data into Hive tables for dynamically adding data into data tables for EDW tables and historical metrics utilizing partitioning and bucketing.

Performed Linux actions on the HDFS server for data lookups, job changes if any commits were disabled, and data storage rescheduling.

Using SQL queries, test and validate database tables in relational databases, as well as execute Data Validation and Data Integration.

Collaborate with SA and Product Owners to gather needs and analyze them for documentation in JIRA user stories for technical and business teams to enhance the requirements.

Documenting confluence tool and technology procedures and workflows for future usage, improvements, and upgrades.

Migrating code to version controllers using Git Commands for future usage and to guarantee a seamless development workflow.

Environment: Moba Extrm, Linux, Shell, JIRA, Confluence, Jupyter, SQL, HDFS, Spark, Hive 2.0, Python, confluence, AWS, CDH, Putty.

Grape soft Solutions Hyderabad, India Nov 2014 – Dec 2016

Data Analyst

Responsibilities:

Attended and participated in information and requirements gathering sessions and translated business requirements into working logical and physical data models for Data Warehouse, Data marts and OLAP applications.

Performed extensive Data Analysis and Data Validation on Teradata and designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN.

Created and maintained Logical Data Model (LDM) for the project includes documentation of all entities, attributes, data relationships, primary and foreign key structures, allowed values, codes, business rules, glossary terms, etc.

Integrated data from various Data sources like MS SQL Server, DB2, Oracle, Netezza and Teradata using Informatica to perform Extraction, Transformation, loading (ETL processes) Worked on ETL development and Data Migration using SSIS and (SQL Loader, PL/SQL).

Involved in Designed and Developed logical & physical data models and Meta Data to support the requirements using ERWIN.

Involved using ETL tool Informatica to populate the database, data transformation from the old database to the new database using Oracle.

Involved in modeling (Star Schema methodologies) in building and designing the logical data model into Dimensional Models and Performance query tuning to improve the performance along with index maintenance.

Involved in the creation, maintenance of Data Warehouse and repositories containing Metadata and wrote and executed unit, system, integration and UAT scripts in a Data Warehouse projects.

Wrote and executed SQL queries to verify that data has been moved from transactional system to DSS, Data Warehouse, and data mart reporting system in accordance with requirements.

Responsible for Creating and Modifying T-SQL stored procedures/triggers for validating the integrity of the data.

Worked on Data Warehouse concepts and dimensional data modelling using Ralph Kimball methodology.

Created number of standard reports and complex reports to analyze data using Slice & Dice and Drill Down, Drill through using SSRS.

Developed separate test cases for ETL process (Inbound & Outbound) and reporting.

Technology: Oracle, MS Visio, PL-SQL, Microsoft SQL Server, SSRS, T-SQL, Rational Rose, Data warehouse, OLTP, OLAP, ERWIN, Informatica 9.x, Windows, SQL, PL/SQL, SQL Server, Talend Data Quality, Oracle 9i/10g, Flat Files, Windows.

Education

National Institute of Technology Trichy, India

Bachelor of Technology

Contact this candidate