Big Data Engineering

Location:

Bellevue, WA

Posted:

July 24, 2024

Contact this candidate

Resume:

Sai Swetha Gannamani

******.**********@*****.*** +1-980-***-****

www.linkedin.com/in/sai-swetha-67a744210

PROFESSIONAL SUMMARY

•8+ years of experience in the Data Engineering domain, building various applications that empower data-driven decision-making using Big Data frameworks, data warehousing solutions leveraging cloud services from AWS and GCP.

•Skilled in utilizing orchestrating tools like Apache Airflow, Apache Oozie, AWS Step Functions to automate the ETL data pipeline jobs.

•Developed ETL pipelines using Apache Spark, Apache Airflow, and AWS Glue to streamline data extraction, transformation, and loading processes across diverse data sources and destinations.

•Skilled in integrating data from diverse sources using RESTful APIs and ensuring seamless data flow across systems using tools like Apache Nifi, Talend, and Informatica.

•Proficient in designing and optimizing data warehouses with platforms like Snowflake, BigQuery, ensuring efficient data storage and analysis.

•Proficient in messaging systems such as Apache Kafka, RabbitMQ, and ActiveMQ, Amazon Kinesis, Google Cloud Pub/Sub for both batch and real-time data processing workflows.

•Solid grasp of RDD (Resilient Distributed Datasets) operations in Apache Spark, including Transformations, Actions, Persistence (Caching), Accumulators, Broadcast Variables, and Optimizing Broadcasts.

•In-depth understanding of Apache Spark job execution components such as DAG, Lineage Graph, DAG Scheduler, Task Scheduler, Stages, and Tasks.

•Experienced in data modeling techniques, including designing schemas for OLAP and OLTP systems.

•Expertise in using Jenkins for Continuous Integration (CI) and Continuous Delivery (CD) to automate and optimize the software development lifecycle.

•Worked with tools within the Hadoop ecosystem, including HDFS, MapReduce, Hive, and Pig.

•Developed dynamic, interactive dashboards and reports with tools such as Tableau, Power BI, and Looker.

•Dedicated to ensuring data accuracy and security, implementing procedures to optimize performance and enforce access controls.

•Crafting SQL queries, stored procedures, functions, packages, tables, views, and triggers across a variety of relational databases including Oracle, MySQL, PostgreSQL.

•Good understanding and knowledge of NoSQL databases like MongoDB, HBase, and Cassandra.

•Hands-on experience in using Amazon Web Services like Glue, S3, Lambda, Autoscaling, RedShift, DynamoDB, RDS, EMR, QuickSight.

•Leveraged Docker and Kubernetes for containerization and orchestration, enhancing application deployment and scalability.

•Thorough understanding and hands-on experience with Software Development methodologies such as Agile and Waterfall.

•Proficient in using version control systems like Git for code management and collaboration. SKILLS

Big Data Technologies: Apache Spark, Spark SQL, Spark Streaming, Hadoop(Map Reduce, HDFS), Apache Kafka, Apache Airflow, Hive, Pig, Oozie, Fulme

Programming Languages: Python, Java, Scala, Shell Scripting, SQL, Pyspark Databases: Oracle, MySQL, MongoDB

Operating Systems: Windows, Linux, MacOS

Cloud Technologies: AWS, GCP, Azure

Data Visualization and Reporting: Power BI, Tableau Messaging Systems: Kafka, Active MQ, Rabbit MQ

Methodologies: Agile, Waterfall

Containerization: Docker, Kubernetes

PROFESSIONAL EXPERIENCE

Amazon, Seattle, WA

Senior Data Engineer

August 2022 – Present

•Automated the ingestion of sales and customer data from various sources into S3 buckets using AWS Glue ETL jobs.

•Conducted real-time data processing via Kinesis Data Streams, orchestrating data retrieval from S3 through a Lambda trigger and used Amazon QuickSight for visualizations.

•Created and managed Hadoop clusters on Amazon EMR to process large volumes of sales data, optimizing performance, ensuring high availability and fault tolerance for continuous data analysis.

•Configured CloudWatch alarms to promptly notify of any issues detected in ETL jobs and S3 buckets.

•Enhanced data extraction and processing times by transitioning from PostgreSQL to DynamoDB using AWS Data Migration Service, achieving a 30% improvement in data retrieval speed.

•Designed and deployed AWS Lambda functions to orchestrate complex workflows, coordinating data ingestion, transformation, and loading processes.

•Developed Spark jobs on an AWS EMR cluster to process sales data retrieved from AWS S3 buckets and store results in DynamoDB using PySpark scripts, ensuring efficient data handling and quick access for sales performance tracking.

•Documented processes, workflows, and best practices to promote knowledge sharing and enhance team collaboration.

•Boosted performance of PySpark SQL script through strategic implementation of caching, persistence, query profiling, and broadcast variables, achieving reduced latency by 20% and optimized resource utilization.

•Optimized ETL workflows to reduce processing time by 40%, leveraging AWS Glue job bookmarks and partitioning data in Amazon S3.

•Conducted root cause analysis and resolved data-related issues promptly using SQL, Python, and AWS CloudWatch to minimize downtime, ensuring continuous and reliable data flow for sales performance assessment.

Environment & Tools: AWS Glue, Amazon EMR, Amazon RedShift, PySpark, S3, CloudWatch, AWS Lambda, JDBC, Amazon QuickSight, Hadoop

DXC Technology, Hyderabad

Data Engineer

October 2019 – July 2021

•Engineered and maintained ETL pipelines that extracted and loaded data into Azure Synapse Analytics, while integrating Apache Spark and Spark Streaming in Azure Databricks (ADB) for enhanced real-time data processing.

•Configured and optimized high concurrency Spark clusters using Azure Databricks to accelerate the preprocessing of high-quality data.

•Created data aggregation pipelines with Apache Spark and Apache Airflow to consolidate and organize diverse datasets.

•Optimized data models for dynamic, real-time usage across diverse applications, catering to OLAP and OLTP requirements.

•Used Databricks notebooks to explore, analyze, and identify trends, patterns, and anomalies in data to inform strategic business decisions.

•Scripted multiple MapReduce programs in Python for data extraction, transformation, and aggregation from various file formats, including XML, JSON, CSV, and compressed formats.

•Implemented unit tests to verify that transformed data conforms to the expected schema.

•Orchestrated ETL pipelines using Apache Airflow to automate and streamline scheduled execution.

•Developed JSON scripts to deploy pipelines in Azure Data Factory utilizing SQL Activity.

•Identified and troubleshooted data inconsistencies in a high-volume transactional database, improving data accuracy by 40%.

•Visualized and created interactive dashboards to provide trends and insights using Tableau.

•Collaborated with data analysts to gather data requirements for various analytics and reporting. Environement & Tools: Azure Databricks, Azure Synapse Analytics, Apache Spark, Spark Streaming, Apache Airflow, Apache Kafka, Apache Data Factory, Tableau Envision, Hyderabad, TG

GCP Data Engineer

January 2017 – September 2019

•Engineered a real-time streaming data pipeline with Pub/Sub and Dataflow, integrated with CloudSQL to manage processed data within GCP for optimized supply chain operations.

•Orchestrated data extraction and integration of data from diverse sources (into a unified data) using Google Data Fusion.

•Developed Pig scripts to perform data transformations and preprocessing tasks within the Hadoop ecosystem, enhancing data quality and consistency.

•Developed Oozie workflows and coordinators to schedule and execute MapReduce jobs on a weekly basis.

•Contributed to the migration of on-premises Hadoop clusters to Google Cloud Platform (GCP), leveraging services like Cloud Storage, Dataproc, Cloud Composer, and BigQuery for scalability, performance, and cost-efficiency.

•Collaborated closely with financial analysts and data scientists to align data infrastructure capabilities with analytical needs.

•Utilized Docker containers to ensure consistent deployment of database applications across multiple environments.

•Implemented robust procedures to enhance data security and enforce access controls within the Hadoop ecosystem.

•Utilized Data Studio and Looker for data visualization and analytics.

•Enhanced database performance through the analysis and optimization of SQL queries, achieving a 30% reduction in query response times and a 20% decrease in server load.

•Applied best practices from data modeling including star schema, normalized schema and data profiling to enhance data structures and uphold data integrity. Environment & Tools: BigQuery, Dataflow, Google Cloud Storage, Cloud Dataproc, Cloud Composer, Pub/Sub, CloudSQL, Data Studio, Looker, Docker, Hadoop(HDFS, Map Reduce), Oozie Sify, Banglore, KA

Data Engineer

September 2015 – January 2017

•Utilized Informatica PowerCenter to design, develop, and maintain ETL pipelines, extracting, transforming, and loading data from SQL Server and flat files into Snowflake enterprise data warehouse to support analytics needs.

•Developed shell scripts to automate data ingestion tasks, enabling seamless and efficient transfer of data between systems.

•Utilized Jenkins to schedule and orchestrate the execution of shell scripts, enhancing automation and ensuring timely data processing.

•Executed test cases and conducted thorough regression testing to validate ETL processes and identify potential data anomalies or discrepancies.

•Designed dynamic dashboards utilizing Tableau, enabling enhanced decision-making by presenting visual trends and key performance indicators.

•Performed statistical analysis by leveraging SQL for querying databases, Python for data manipulation and statistical modeling.

•Collaborated with business stakeholders, accountants, and programmers as necessary to address reporting and data analysis requirements aligned with business objectives. Environment & Tools: Informatica PowerCenter, SQL, Python, Tableau, SQL Server, Shell, Snowflake EDUCATION

Master's in Computer Science, Stony Brook University August 2021 - July 2022 Bachelor's in Computer Science and Engineering July 2011 - July 2015

Contact this candidate