Data Engineer Big

Location:

Boston, MA

Posted:

April 01, 2025

Contact this candidate

Resume:

Name: SUBHANI, FNU

Senior Data Engineer

Phone: 401-***-****

Email: **********@*****.***

LinkedIn: www.linkedin.com/in/fnu-subhani-94b26026

PROFESSIONAL SUMMARY

Over 8 years of experience in the software industry, specializing in AWS cloud services.

Proficient in Big Data technologies, including Spark, MapReduce, Hive, YARN, and HDFS.

Skilled in programming languages such as Scala and Python.

Knowledgeable in Data Warehousing, with extensive experience in managing AWS Redshift and designing data lakes.

Hands-on expertise with AWS components, including AWS Data Pipeline, AWS Glue, AWS Lambda, Amazon S3, Amazon Redshift, and AWS DevOps tools.

Demonstrated ability to efficiently migrate data across AWS services, showcasing competence in AWS data migration and storage solutions.

Strong background in Data Load/Integration using AWS services. Experience in building ETL pipelines leveraging AWS Glue and PySpark.

Experience developing pipelines in Spark using Scala and PySpark. Experience working with AWS Step Functions for process automation.

Hands-on experience in the complete Software Development Life Cycle (SDLC) for projects using methodologies like Agile and hybrid methods.

Experience analyzing data using Big Data Ecosystems, including HDFS, Hive, HBase, Zookeeper, PIG, Sqoop, and Flume.

Knowledge and working experience with big data tools like Hadoop, AWS EMR, and Apache Airflow.

Proficient in utilizing Snowflake to design scalable cloud-based data warehousing solutions, improving data query performance and system scalability.

Experienced in implementing data security and compliance measures within Snowflake, ensuring robust data governance and privacy standards are met.

Developed and optimized data pipelines using Snowflake, enhancing data ingestion, storage, and retrieval processes to support real-time analytics and business intelligence.

Experience in workflow scheduling with Airflow and AWS Data Pipeline.

Experience in migrating SQL databases to AWS RDS, AWS Redshift, and managing data access and security in cloud environments.

Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop components such as Job Tracker, Task Tracker, Name Node, and Data Node.

Experience in Text Analytics and Data Mining solutions to various business problems, and generating data visualizations using Python and AWS QuickSight.

Strong knowledge in developing Spark applications using Spark-SQL in AWS EMR for data extraction, transformation, and aggregation from multiple file formats.

Experience in AWS cloud services, machine learning, and full-stack system integration. Engaged in ongoing AWS training, focusing on security and machine learning enhancements.

Good understanding of Spark Architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.

Experienced in working with real-time streaming with Kafka and using AWS Kinesis as a data pipeline.

Deployed AWS Lake Formation to enable data lake solutions with fine-grained access control and transaction support, ensuring data consistency and reliability.

Experience with Apache Kafka for messaging and streaming applications and integrating with AWS services using Scala.

Well-versed in using ETL methodology for supporting solutions across the enterprise using AWS Glue and other integration tools.

Worked on data serialization formats for converting complex data objects using formats like Parquet, ORC, AVRO, JSON, and CSV.

Optimized Spark jobs and workflows by tuning Spark configurations, partitioning, and memory allocation in AWS environments.

Extensive experience in developing, maintaining, and implementing EDW, Data Marts, ODS, and Data Warehouse architectures using AWS technologies.

Hands-on experience with GitHub to manage and maintain code versions effectively.

TECHNICAL SKILLS:

Category:

Technologies/Tools

Big Data Technologies:

MapReduce, Hive, Python, Pyspark, Scala, Kafka, Spark Streaming, Oozie, Sqoop, Zookeeper

Hadoop Distribution:

Cloudera, Hortonworks

AWS Services:

Amazon S3, Amazon Redshift, AWS EMR, AWS SNS, AWS SQS, AWS Athena, AWS Glue, AWS CloudWatch, AWS Kinesis, AWS Route 53, AWS IAM, AWS EC2, AWS Lambda, AWS Quick Sight, AWS Macie, Amazon MSK

Languages:

Java, SQL, PL/SQL, Python, HiveQL, Scala

Web Technologies:

HTML, CSS, JavaScript, XML, JSP, RESTful, SOAP

Operating Systems:

Windows, UNIX, Linux, Ubuntu, CentOS

Build Automation Tools:

Ant, Maven

Version Control:

GIT, GitHub

IDE & Build Tools, Design:

Eclipse, Visual Studio

Databases:

MS SQL Server 2016/2014/2012, AWS RDS, AWS DynamoDB, Oracle 11g/12c, Cosmos DB, MS Excel, MS Access

WORK EXPERIENCE:

Client: DTCC, TX June 2023 – Present

Role: Sr. Data Engineer

Managed all phases of software engineering including requirements analysis, application design, coding, and testing, ensuring adherence to best practices in cloud architecture.

Developed and maintained robust ETL data pipelines using AWS Data Pipeline and AWS Glue, working with extensive datasets stored in Amazon S3.

Utilized knowledge of the AWS big data ecosystem to create scalable solutions using AWS EMR, leveraging technologies such as Spark, Scala, Python, and Hive.

Facilitated the onboarding of applications from various platforms by creating stubs for producers, consumers, and consumer groups, enhancing integration capabilities.

Supported data transformation processes including the management of data structures, metadata, dependencies, and workload across various cloud-based platforms.

Built and optimized ELT/ETL pipelines for moving data to and from AWS Redshift, employing Python and SQL techniques for efficient data storage and retrieval.

Designed and implemented ETL workflows using custom scripts and AWS services to integrate data from diverse sources into consolidated data warehouses.

Wrote complex SQL queries for data extraction and manipulation in Amazon RDS and Redshift, ensuring optimal performance of database operations.

Implemented and maintained large-scale data warehouses using Snowflake, optimizing data storage and retrieval processes to support advanced analytics.

Designed and executed data migration strategies from legacy systems to Snowflake, ensuring seamless data integration and minimal downtime.

Developed complex SQL queries and data transformation scripts within Snowflake, enhancing data manipulation and reporting capabilities.

Configured Snowflake's data sharing features to securely share real-time insights across different departments, facilitating improved decision-making and operational efficiency.

Created and managed database schemas, tables, views, indexes, and stored procedures using SQL, maintaining data integrity and accessibility.

Developed data transformations and validations using AWS Lambda and AWS Glue, ensuring high data quality and compliance with business rules.

Utilized AWS Sage Maker for deploying machine learning models to drive predictive analytics projects. Developed full-stack applications using AWS Amplify and API Gateway. Managed containerized applications using AWS ECS and EKS.

Collaborated with AWS architects to monitor and troubleshoot issues related to process automation and data pipeline efficiency.

Coded and optimized AWS Lambda functions for data extraction, transformation, and loading processes, handling data from diverse sources like databases and APIs.

Designed and maintained data integration solutions across Hadoop and RDBMS within AWS, ensuring seamless data flow and system interoperability.

Implemented continuous integration and deployment pipelines using AWS CodePipeline and Jenkins, enhancing development operations and project agility.

Deployed and managed Delta Lake on AWS to ensure data consistency and support ACID transactions, particularly for high-stake projects.

Maintained and enhanced data pipelines using Delta Lake on AWS, improving data reliability and operational efficiency significantly.

Worked closely with DevOps to develop and maintain automated CI/CD pipelines tailored to project specifications, enhancing workflow efficiency.

Gained hands-on experience in programming with Python and Scala, applying advanced coding skills to solve complex data engineering challenges.

Actively managed Hive scripts and Spark SQL tasks to maintain data integrity and ensure stability across all ETL operations.

Utilized JIRA to manage project deliverables, track issues, and coordinate tasks across development, QA, and partner validation phases.

Engaged in full breadth of Agile practices, from daily stand-ups to internationally coordinated PI planning, ensuring alignment with project goals and timelines.

Environment: AWS EMR, AWS Data Pipeline, AWS Lambda, Amazon S3, Amazon Redshift, AWS Glue, Amazon RDS, Amazon DynamoDB, HDFS, MapReduce, YARN, Spark, Hive, SQL, Python, Scala, pyspark, shell scripting, GIT, JIRA, Jenkins, Apache Kafka, AWS Step Functions, AWS Quick Sight.

Client: Verisk, Boston, Massachusetts Dec 2022 - June 2023

Role: Data Engineer

Responsibilities:

Developed AWS-based scripting solutions to automate data pipelines, ETL processes, and data transformations, utilizing AWS Glue and Lambda.

Designed and implemented comprehensive data ingestion and storage solutions using AWS S3, Redshift, and Glue.

Engineered ETL workflows using AWS Glue to extract, transform, and load data from multiple sources into Redshift, enhancing data integration and analytics capabilities.

Integrated AWS SNS and SQS for real-time event processing and messaging, improving communication and process efficiency.

Extensively utilized AWS Step Functions for orchestrating complex workflows and monitoring them to ensure operational integrity.

Implemented AWS Athena to perform ad-hoc data analysis directly on data stored in S3, providing flexible and powerful querying capabilities.

Employed AWS CloudWatch for resource monitoring, setting up alarms, and collecting metrics to maintain system health and performance.

Designed and executed real-time data streaming solutions using AWS Kinesis, facilitating immediate data processing needs.

Managed DNS configurations and routing effectively using AWS Route53, ensuring optimal application and service deployment.

Leveraged Spark JDBC connections to efficiently extract data from diverse sources, including relational databases and CSV files stored in Amazon S3 buckets.

Utilized Spark SQL to build and manipulate complex data frames, performing sophisticated SQL operations to support advanced data analysis.

Conducted operations on data frames, including schema management, advanced aggregations, data type conversions, and complex joins, enhancing data usability.

Established topics in Amazon SNS for effective notifications to subscribers, and implemented cross-account messaging with Amazon SQS.

Employed Apache Kafka integrated with AWS services for real-time error capture and developed Spark Streaming applications for processing and storing streaming data using Scala.

Interfaced with various databases, including PostgreSQL, for data retrieval and execution of data extraction tasks.

Orchestrated Docker containers to streamline application deployment and management, ensuring consistent environments across development stages.

Managed code, data, and configurations across multiple environments (Development, QA, Production) using Jenkins and AWS CodePipeline, ensuring continuous integration and delivery.

Utilized AWS Glue extensively to deploy and manage Spark jobs within AWS EMR clusters, optimizing big data processing tasks.

Developed AWS Lambda functions to automate server management tasks and execute essential code snippets efficiently within the AWS cloud environment.

Engaged in comprehensive version control practices using Bitbucket to enhance team collaboration and code quality.

Worked with JSONB data formats for data conversion and storage solutions, utilizing AWS technologies for optimal performance.

Utilized Terraform scripts for provisioning and managing AWS resources effectively, ensuring the scalable deployment of EMR Spark jobs.

Led the deployment of Spark jobs on EMR clusters, significantly contributing to distributed data processing and analytics across the organization.

Actively participated in Agile/Scrum development processes, including sprint planning, daily stand-ups, and iterative development cycles, to ensure alignment with project goals.

Environment: AWS S3, Redshift, Glue, AWS SNS, AWS SQS, AWS Athena, AWS CloudWatch, AWS Kinesis, AWS Route53, Amazon RDS, AWS EMR, Spark, Hive, MapReduce, PostgreSQL, Oracle, SQL Server, Terraform, Docker, Jenkins, Git, Bitbucket, Apache Kafka, AWS Lambda, AWS Step Functions, Boto3.

Client: Axis Bank, Hyderabad, India Sep 2019 - Mar 2021

Role: Big Data Engineer

Responsibilities:

Utilized AWS Data Pipeline and AWS Glue to automate data ingestion from MySQL to Amazon S3, transitioning from traditional Sqoop usage.

Performed data aggregations using Apache Spark and Scala within AWS EMR, storing results in AWS Glue Data Catalog for further analysis.

Managed data lakes using AWS Lake Formation and worked within AWS big data ecosystems, integrating with EMR and replacing traditional Hadoop environments like Hortonworks and Cloudera.

Developed HiveQL queries within AWS EMR to perform complex data analysis to meet business requirements.

Created and managed HBase tables on AWS, utilizing integration with Hive for analytics capabilities.

Processed streaming data using Kafka and AWS Kinesis to enhance real-time data analytics.

Developed data pipelines with AWS Glue and used Kinesis Data Firehose for ingesting behavioral data into AWS S3 for analysis.

Analyzed data clusters using tools available in AWS EMR, leveraging Spark, Hive, and custom MapReduce jobs.

Integrated Kafka, Spark, and Hive on AWS EMR to construct data pipelines that ingest, transform, and analyze large datasets.

Implemented UNIX and YAML scripting within AWS environments to define workflows and automate deployment processes using AWS CloudFormation.

Migrated large datasets from Oracle RDBMS to AWS using AWS Database Migration Service, improving data processing workflows.

Used PySpark and Spark SQL within AWS EMR for advanced data processing and testing, significantly reducing processing times.

Configured AWS Kinesis for handling and processing streaming data into batches, optimizing batch processing techniques.

Employed AWS services to coordinate and synchronize operations across server clusters, effectively replacing Zookeeper.

Managed and scheduled jobs using AWS Step Functions and Lambda, improving efficiency and scalability over the Oozie workflow engine.

Maintained code repositories using Git, integrated within AWS Code Commit to enhance version control and team collaboration.

Environment: AWS EMR, AWS Data Pipeline, AWS Glue, Amazon S3, AWS Lake Formation, AWS Kinesis, Apache Spark, Hive, AWS RDS, Kafka, AWS Database Migration Service, AWS CloudFormation, AWS Step Functions, AWS Lambda, MySQL, Python, pyspark, Shell Scripting, Git, AWS Code Commit, JIRA.

Client: Careator Technologies, Hyderabad, India Mar 2016 - Aug 2019

Role: Data Warehouse / Big Data Developer

Responsibilities:

●Working as SQL Server Analyst / Developer / DBA using SQL Server 2012, 2015, 2016.

●Created jobs, SQL Mail Agent, Alerts, and schedule DTS/SSIS Packages.

●Involved to Manage and update the Erwin models - Logical/Physical Data Modeling for Consolidated Data Store (CDS), Actuarial Data Mart (ADM), and Reference DB according to the user requirements.

●Involved to Export the current Data Models into PDF out of Erwin and publish them on to SharePoint for various users.

●Working on Writing Triggers, Stored Procedures, Functions, Coding using Transact-SQL (TSQL), create and maintain Physical Structures.

●Worked on GIT to maintain source code in Git and GitHub repositories.

●Prepared an ETL framework with the help of Sqoop, pig and hive to be able to frequently bring in data from the source and make it available for consumption.

●Experience in developing complex store procedures, efficient triggers, required functions, creating indexes and indexed views for performance.

●Excellent Experience in monitoring SQL Server Performance tuning in SQL Server

●Involved in designing ETL data flows using SSIS, creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Access/Excel Sheets using SQL Server SSIS.

●Efficient in Dimensional Data Modeling for Data Mart design, identifying Facts and Dimensions, and developing, fact tables, dimension tables, using Slowly Changing Dimensions (SCD).

●Experience in Error and Event Handling: Precedence Constraints, Break Points, Check Points, Logging.

●Experienced in Building Cubes and Dimensions with different Architectures and Data Sources for Business Intelligence and writing MDX Scripting.

●Working on Developing SSAS Cubes, Aggregation, KPIs, Measures, Partitioning Cube, Data Mining Models and Deploying and Processing SSAS objects.

●Experience in creating Ad hoc reports and reports with complex formulas and querying the database for Business Intelligence.

●Expertise in developing Parameterized, Chart, Graph, Linked, Dashboard, Scorecards, Report on SSAS Cube using Drill-down, Drill-through and Cascading reports using SSRS.

●Extracted the data from MySQL into HDFS using Sqoop.

●Implemented automation for deployments by using YAML scripts for massive builds and releases, Apache Hive, Apache Pig, HBase, Apache Spark, Zookeeper, Flume, Kafka and Sqoop.

●Implemented Data classification algorithms using MapReduce design patterns.

●Extensively worked on creating combiners, Partitioning, distributed cache to improve the performance of MapReduce jobs.

Environment: SQL Server 2008/2012 Enterprise Edition, SSRS, SSIS, T-SQL, Shell script, Windows Server 2003, PerformancePoint Server 2007, Oracle 10g, Hadoop, Hive, spark, PySpark, Sqoop, Spark SQL, Shell script, Cassandra, YAML, ETL.

Contact this candidate