Azure Data Engineer

Location:

Albany, NY

Posted:

December 19, 2024

Contact this candidate

Resume:

Ranadheer Reddy Garlapati

Senior Data Engineer

Email Id: ****************@*****.***

Phone: 469-***-****

LinkedIn: Ranadheer Reddy Garlapati LinkedIn

PROFESSIONAL SUMMARY:

Accomplished Data Engineer with 8 years of IT experience, adept in designing and deploying cloud ETL solutions on Microsoft Azure platform through Azure Data Factory, Azure Databricks and Azure Data Lake Storage.

6 years of dedicated experience in Azure Cloud Solutions and Big Data technologies, designing and establishing optimal cloud solutions for efficient data migration and processing.

Comprehensive expertise in architecting enterprise-scale data solutions using Azure Data Factory, implementing sophisticated ETL/ELT workflows and optimization strategies while ensuring robust data governance and compliance with industry standards across cloud environments.

Advanced proficiency in Azure Data Lake Storage Gen2 implementation, specializing in hierarchical namespace management, access control optimization, and efficient data organization strategies for large-scale enterprise deployments across multiple business domains.

Deep understanding of Azure Databricks ecosystem, leveraging PySpark and SparkSQL for complex data transformations while implementing sophisticated performance tuning, automated scaling, and cost optimization strategies across development environments.

Expertise in Azure Synapse Analytics architecture, focusing on workload management, materialized views implementation, and establishing efficient data warehouse patterns while maintaining optimal query performance and resource utilization.

Strong capabilities in Azure Event Hubs and Stream Analytics configuration, implementing real-time data processing architectures with robust error handling, automated recovery mechanisms, and comprehensive monitoring solutions.

Advanced expertise in Azure Monitor and Application Insights implementation, developing comprehensive monitoring solutions with custom metrics, sophisticated alerting systems, and detailed performance optimization dashboards across environments.

Extensive knowledge in Azure Security frameworks, leveraging Azure Key Vault and Azure Active Directory for implementing robust security architectures, encryption standards, and compliance-focused access control solutions.

Proven proficiency in Azure Logic Apps development, specializing in workflow automation and integration patterns while implementing custom connectors, error handling mechanisms, and automated notification systems.

Strong understanding of Azure Data Factory mapping data flows, implementing complex transformation patterns with built-in data validation rules, quality checks, and sophisticated error handling mechanisms.

Advanced capabilities in Azure DevOps practices, implementing automated CI/CD pipelines and Infrastructure as Code (IaC) methodologies while ensuring consistent deployment across development and production environments.

Strong expertise in Delta Lake architecture within Azure Databricks, implementing ACID transactions, version control mechanisms, and time travel capabilities while optimizing storage costs through compression techniques.

Deep knowledge in Azure Functions development, implementing serverless architectures and event-driven solutions for automated data processing while ensuring optimal performance and cost-effective resource utilization.

Demonstrated proficiency in Azure data movement optimization, implementing parallel processing strategies and compression techniques while monitoring and optimizing data transfer costs across cloud environments.

Advanced understanding of Azure Log Analytics implementation, developing comprehensive audit frameworks and monitoring solutions while ensuring compliance with data governance requirements and regulatory standards.

Extensive experience in incremental loading patterns using Azure technologies, implementing changed data capture mechanisms and maintaining historical data integrity while ensuring optimal performance and accuracy.

Strong capabilities in Azure Purview implementation, developing automated data discovery systems, classification frameworks, and comprehensive data governance solutions while maintaining detailed lineage across enterprise environments.

Comprehensive expertise in Azure cost optimization strategies, implementing resource utilization monitoring, automated scaling solutions, and cost management frameworks while maintaining optimal performance across cloud services.

Advanced proficiency in hybrid architecture design, implementing secure integration patterns between on-premises systems and Azure services while ensuring robust connectivity and efficient data synchronization mechanisms.

Demonstrated knowledge in Azure data quality frameworks, implementing automated validation mechanisms, comprehensive testing strategies, and detailed reporting systems while maintaining high data accuracy standards.

Strong understanding of Azure-based metadata management, implementing robust cataloging systems, maintaining detailed data lineage, and ensuring comprehensive documentation across enterprise data solutions and platforms.

Developed streamlined data ingestions and integrations using Tez on large-scale big data ETL tasks.

Configured and implemented Zookeeper to ensure efficient coordination and synchronization of distributed data processing systems.

Demonstrated expertise in implementing advanced serialization techniques to optimize data storage, transfer, and deserialization processes.

Optimized performance tuning for OLAP/OLTP in Azure environments, enhancing query execution and data retrieval efficiency.

Scheduled and monitored data workflows with Control-M and Apache Airflow to coordinate the execution of complex tasks.

Exceptional command in working with diverse file formats like Parquet, CSV, JSON, Avro and ORC for efficient storage and exchange within data pipelines.

Facilitated adoption of DevOps practices, implemented Version Controls (Git, GitHub, Repo), and supported the set-up of automated CI/CD pipelines for faster software delivery across multiple development environments.

Experienced working along the Agile Scrum Methodology and extensively participated in Sprint Planning, Daily Scrum updates and Retrospective meetings.

TECHNICAL SKILLS:

Azure Technologies : Azure Data Factory (ADF), Azure Synapse Analytics, Azure Event Hub, Poly Base, Azure SQL Server, Azure Stream Analytics, Azure Data Lake GEN2, Azure Logic Apps, Azure Functional Apps, Azure Blob Storage, Azure Databricks, Azure Virtual Machines, Azure Data Lake Analytics, Azure Active Directory, Azure Data Catalog, Azure Monitor, Microsoft Purview, data governance, data management, Azure Cosmos DB, Azure Key Vault, Azure DevOps, Azure HDInsight, Azure Log Analytics, Azure Service Bus, Azure Resource Manager Templates, Azure Application Insights, Azure CLI.

Big Data Technologies: Hadoop, HDFS, Map Reduce, Yarn, Hive, Spark, Sqoop, Hive, Pig, HBase, Flume, Spark, Kafka, Performance Tuning, Oozie, Zookeeper

Databases: Oracle, MySQL, MSSQL, SQL Server, MongoDB, Cassandra, Sybase, PostgreSQL

Programming Languages: Python, Pyspark, Shell script, SQL, Scala, R, SQL, PL/SQL

Automation Technologies: Jenkins, Terraform, Docker, Kubernetes, Apache Airflow, GitLab CI/CD

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, SQL Navigator, SQL Server Management Studio, Eclipse.

Version Control: SVN, Git, GitHub, Maven, bitbucket, GitLab

Visualization/Reporting: Tableau, Power BI

File Format: ORC, Parquet, Avro, JSON, DELIMITED (CSV), XML, TXT, Excel

EDUCATION

Bachelor’s in computer science and engineering, JNTU-H in India in 2016

CERTIFICATIONS

AZ-900 Microsoft Azure Fundamentals

DP-203 Microsoft Azure Data Engineer Associate

WORK EXPERIENCE

Client: Premera Blue Cross, Mountlake Terrace, WA Nov 2023– Till Date

Role: Senior Data Engineer

Responsibilities:

Architected medallion architecture in Azure Data Lake, implementing bronze/silver/gold zones with retention policies, utilizing Azure Data Factory tumbling window triggers connecting to Azure Databricks Auto Loader pipelines with PySpark.

Optimized Azure Databricks Delta Live Tables and Unity Catalog implementation leveraging PySpark RDDs on auto-scaling clusters with SQL stored procedures processing data in Azure Data Lake hierarchical namespace zones.

Engineered enterprise-scale Azure Data Factory metadata-driven pipelines with mapping dataflows and custom activities, orchestrating Azure Databricks notebook workflows and job clusters connected to Azure Synapse serverless pools through Python functions.

Developed Azure Synapse dedicated SQL pools with materialized views and column store indexes, utilizing Python stored procedures and dynamic SQL queries, integrated with Azure Data Factory parallel copy activities.

Created CI/CD-enabled Azure Data Factory self-hosted integration runtime with custom activities, orchestrating Azure Synapse workload management through Python monitoring and automated Git pull requests.

Enhanced Azure Databricks photon engine configurations with advanced PySpark optimization techniques, processing Azure Data Lake snapshots through Azure Data Factory dependency chaining and incremental loads.

Designed Azure Synapse dynamic PolyBase external tables utilizing Azure Data Factory sequential activities, connecting to Azure Databricks structured streaming jobs with PySpark vectorized UDFs.

Developed real-time streaming solutions using Azure Data Factory event-driven triggers connecting Azure Databricks checkpointing mechanisms with PySpark windowing functions writing to Azure Data Lake immutable storage.

Optimized resource utilization in Azure Synapse through workload importance classification and Azure Databricks cluster policies, managed by Azure Data Factory concurrent pipeline execution with Python monitoring.

Developed comprehensive data profiling in Azure Databricks using Ganglia metrics with PySpark profiling functions and Python analysis of Azure Data Lake storage tiers through Azure Data Factory monitoring.

Implemented row-level security in Azure Synapse with dynamic data masking and Azure Databricks table ACLs, orchestrating secure data movement through Azure Data Factory managed identities with Python encryption.

Created self-service analytics combining Azure Synapse serverless pools and Azure Databricks SQL warehouses, processing Azure Data Lake lifecycle-managed data through Azure Data Factory metadata-driven pipelines.

Implemented Azure Databricks workflow orchestration using spot instances and job clusters, processing data through PySpark broadcast variables and Python UDFs, integrated with Azure Synapse serverless pools.

Created MLflow tracking enabled Azure Databricks machine learning pipelines using PySpark ML libraries and Python scikit-learn models, processing Azure Data Lake mounted data through Azure Data Factory sequential orchestration.

Implemented dynamic pipeline framework in Azure Data Factory with custom parameters, integrating Azure Databricks secrets scopes connecting to Azure Data Lake managed identities through Python SDK automation.

Developed and optimized SQL queries for data extraction, transformation, and loading (ETL) processes across various platforms, ensuring data accuracy and efficiency.

Managed and optimized data storage and processing using various file formats Like PARQUET, JSON, CSV, ensuring compatibility and efficiency across multiple systems.

Utilized specific file formats such as Parquet and ORC for efficient storage and querying in big data environments, improving data retrieval speeds.

Automated Azure resource management using PowerShell scripts reduces manual configuration time by 40% and improves deployment efficiency.

Developed and optimized complex Snow SQL queries for efficient data extraction, transformation, and loading (ETL) operations within Snowflake.

Designed and implemented Snowflake Schema for optimized data warehousing, enabling efficient data storage and retrieval.

Developed Star Schema data models for OLAP systems, ensuring efficient query performance and simplified data navigation.

Developed interactive Power BI dashboards to visualize key business metrics, improving decision-making processes and enabling real-time insights for stakeholders.

Integrated Power BI with Azure Synapse Analytics for real-time data visualization, reducing the time to insight by 40%.

Worked in a fast-paced Agile environment, collaborating with cross-functional teams to deliver high-quality data solutions on time and within budget.

Client: Square, San Francisco, California Feb 2022 – Nov 2023

Role: Senior Data Engineer

Responsibilities:

Engineered payment processing pipeline managing 1M+ daily merchant transactions using Azure Synapse Analytics with PolyBase, implementing Python validation frameworks and PySpark DataFrames through Azure Databricks for robust data processing.

Developed Point of Sale analytics utilizing Azure Data Factory with mapping dataflows and runtime services, monitoring real-time inventory across Data Lake Gen2 zones using Python UDFs and advanced SQL optimization patterns.

Built payment platform leveraging Azure Databricks Delta Live Tables and Unity Catalog with MLflow, orchestrating complex Python and PySpark ML pipelines through Azure Synapse serverless pools and message queues.

Led enterprise migration to Azure Synapse dedicated pools with column-store indexes, implementing advanced Python Pandas optimizations and PySpark broadcast variables across Data Lake zones for enhanced performance.

Implemented CDC pipeline handling 2M+ records using Azure Data Factory tumbling window triggers, integrating with Azure Databricks Auto Loader and SQL materialized views for efficient data synchronization.

Developed comprehensive data governance utilizing Azure Data Lake RBAC with ACLs, implementing Azure Databricks Unity Catalog and secret scopes with Python encryption for enterprise-grade security standards.

Created analytics platform processing 100TB+ using Azure Synapse with dynamic PolyBase, orchestrating Azure Data Factory incremental loads with PySpark broadcast joins and advanced aggregation frameworks.

Engineered robust streaming architecture leveraging Azure Databricks structured streaming with checkpointing, processing Kafka events through Python scikit-learn models into optimized Delta Lake storage layers.

Implemented enterprise MLOps solutions utilizing Azure Data Factory activities to orchestrate Azure Databricks MLflow experiments with PySpark ML models and integrated SQL feature store systems.

Developed financial data warehouse in Azure Synapse, implementing row-level security patterns and managing complex Azure Data Factory metadata through Python Airflow monitoring and alerting frameworks.

Built enterprise IoT analytics platform using Azure Databricks job clusters and spot instances, implementing sophisticated PySpark window functions into Azure Data Lake medallion architecture for scalable data processing.

Created a comprehensive reporting platform utilizing Azure Synapse serverless pools with advanced statistics and partitioning strategies, orchestrating optimized Azure Data Factory pipelines for enterprise data delivery.

Implemented robust data quality frameworks using Azure Databricks Delta Live Tables and Great Expectations, leveraging Python assertions and PySpark schema enforcement across multiple production environments.

Engineered scalable metadata-driven ETL using Azure Data Factory self-hosted runtime with parallel copies, connecting Azure Synapse through optimized PySpark UDFs and broadcast variables for performance.

Developed high-performance real-time solutions utilizing Azure Databricks photon engine and Ganglion metrics with Azure Synapse Link, implementing advanced Python vectorization for data processing.

Implemented comprehensive best practices for data partitioning and indexing strategies in Azure Synapse, optimizing query performance and SQL Server database operations for enterprise workloads.

Secured critical data pipelines by implementing Azure Key Vault integration for managing secrets and certificates, ensuring robust security compliance across multiple production environments and data zones.

Managed optimized data storage implementing ORC, PARQUET, AVRO, JSON, and CSV file formats, ensuring maximum compatibility and efficiency across enterprise data processing pipelines.

Developed scalable reporting solutions through collaboration with data analysts, implementing advanced data processing frameworks to deliver actionable insights for business decision-making.

Enabled enterprise-wide data visualization through Power BI integration with Azure Synapse, implementing optimized refresh patterns and incremental loading for real-time business intelligence dashboards.

Client: Virtusa, India Oct 2017 – Dec 2021

Role: Data Engineer

Responsibilities

Developed end-to-end data pipelines using Azure Data Factory, Azure Synapse Analytics, and Azure Databricks to streamline data ingestion, transformation, and storage for multiple business units, enhancing data accessibility by 40%.

Engineered and implemented ETL jobs using Spark-Scala to facilitate data migration from Oracle to new MySQL tables, ensuring data integrity and improved performance.

Architected and optimized Azure Databricks ETL pipelines using Delta Lake for migrating data between Azure SQL Database and Azure Data Lake Storage Gen2, implementing robust monitoring and validation frameworks.

Developed and maintained production-grade Azure Databricks notebooks utilizing Spark-Scala and PySpark for large-scale data transformations, achieving a 40% improvement in processing time across enterprise workloads

Engineered enterprise data pipelines in Azure Databricks using Delta Lake and Spark Structured Streaming, ensuring data quality and reliability while processing over 500TB of daily data volumes.

· Implemented and orchestrated Azure Data Factory pipelines integrated with Azure Databricks notebooks for complex ETL workflows, incorporating parallel processing and comprehensive error-handling mechanisms.

Created robust real-time streaming solutions using Azure Event Hubs with Azure Databricks, processing over 1 million events per second while maintaining sub-second latency and fault tolerance.

Leveraged Azure Blob Storage and Azure SQL Database to architect highly efficient and scalable data storage solutions, reducing data processing costs by 30% through optimized resource allocation.

Managed data access control in Azure Databricks workspaces using Azure Active Directory integration, implementing Secret Scopes and Access Control Lists for enterprise security compliance.

Designed and implemented automated CI/CD pipelines using Azure DevOps for Azure Databricks notebook deployment, establishing comprehensive testing protocols and version control management.

Optimized Azure Databricks cluster configurations and Spark SQL queries through performance tuning and resource management, achieving a 35% reduction in compute costs and processing time.

Developed comprehensive monitoring solutions using Azure Monitor and Log Analytics for Azure Databricks clusters, implementing automated alerting and custom dashboard creation for performance tracking.

Implemented robust security frameworks through Azure Active Directory integration with Azure Databricks, establishing RBAC policies and managing access controls across multiple production environments.

Built scalable data validation framework using Azure Databricks and Delta Lake, incorporating automated quality checks, schema validation, and performance optimization for enterprise data pipelines.

Developed ETL data pipelines using Python and PySpark in Azure Databricks, enabling batch transformations for 50TB daily data while reducing processing time by 20% through code optimization

Implemented Python solutions for data processing within Azure Functions and Event Hubs, creating fault-tolerant architectures that handled 200K+ events per second with consistent performance.

Engineered production-grade Python and PySpark workflows in Azure Databricks, leveraging Delta Lake and implementing automated testing frameworks that improved pipeline stability by 25%

Orchestrated complex data ingestion pipelines using Azure Data Factory and Azure Databricks, implementing efficient incremental loading patterns and comprehensive logging for data lineage tracking.

Seamlessly integrated Azure Synapse with Azure Data Lake Storage and Spark pools, creating unified data environments that supported scalable big data processing and machine learning model training.

Created robust data models within Synapse to support ad-hoc reporting and business intelligence needs, resulting in a 25% reduction in time spent on data analysis tasks.

Provided production support by executing sessions, diagnosing issues, and adjusting mappings as needed for evolving business logic, ensuring uninterrupted flow of data and smooth operation of the ETL workflows.

Performed Unit testing and Integration testing on mappings and workflows to validate their functionality and reliability, ensuring the accuracy and integrity of data throughout the ETL process.

Actively engaged in daily status calls with internal teams and delivered detailed weekly updates to clients through comprehensive reports, promoting effective communication, transparency, and project alignment.

Client: Virtusa, India Aug 2016 — Sep 2017

Role: Big Data Engineer

Responsibilities

Spearheaded the migration from Informatica (version 10.x) and SSIS to a Hadoop-based ecosystem, successfully transforming legacy ETL processes to support large-scale data processing and reducing data processing time by 40%.

Led the development of high-performance MapReduce programs for processing 10TB+ datasets, implementing custom Combiners and Partitioners, resulting in 30% faster processing time and improved resource utilization.

Configured and optimized Apache Kafka (v1.x) for real-time data ingestion, enabling the integration of high-velocity data streams and reducing latency in data availability for analysis

Architected and maintained HDFS infrastructure supporting petabyte-scale data operations, implementing replication policies and namespace management while ensuring 99.99% data availability across distributed clusters.

Engineered optimized HiveQL scripts with advanced features like partitioning, bucketing, and ORC file formats, reducing query execution times by 25% and implementing automated data quality checks.

Designed and implemented enterprise Apache Oozie workflows orchestrating complex data pipelines across 100+ nodes, achieving 40% reduction in job failures and streamlining dependency management.

Tuned Hadoop cluster performance and implemented YARN resource management, maximizing cluster utilization and reducing job run times by 20%.

Integrated Apache Kafka with the Hadoop ecosystem for real-time data streaming, processing 1M+ events per second with fault tolerance, resulting in 40% faster data ingestion.

Developed and optimized Sqoop jobs for bi-directional data transfer between Hadoop and enterprise databases (MS SQL, MySQL), implementing incremental loads and parallel processing.

Architected scalable data processing solutions using PySpark and Scala, implementing custom transformations and optimized shuffling operations, achieving 20% improved processing efficiency.

Optimized Hadoop and Spark cluster performance through advanced configuration tuning, resource allocation, and memory management, reducing job execution times by 40% across production workloads

Implemented distributed Apache Cassandra database architecture supporting 50TB+ data, ensuring high availability through proper partitioning strategies and consistency level configurations.

Deployed Apache Zookeeper for distributed coordination across multiple Hadoop clusters, implementing leader election and configuration management while maintaining 99.9% system uptime.

Engineered robust data ingestion workflows using Apache Flume, implementing custom Source-Channel-Sink configurations for processing 100TB+ daily log data with guaranteed delivery.

Optimized HBase data modeling with efficient row-key design and column family organization, implementing bloom filters and compression, reducing query latency by 45%.

Engineered streaming data solutions using Python with Apache Kafka, handling real-time data processing at scale

Implemented enterprise job scheduling using Control-M, orchestrating 1000+ daily batch processes with advanced dependency management, reducing failed jobs by 60% through proactive monitoring.

Developed complex Pig scripts for ETL operations, incorporating UDFs and custom load functions, processing 5TB+ daily data while ensuring data quality and transformation accuracy.

Worked with stakeholders to gather requirements, define KPIs, and deliver reports and dashboards using tools like Microsoft SQL Server Reporting Services (SSRS) and Tableau.

Participated in Agile development methodologies including sprint planning, daily stand-ups, and retrospectives.

Set up and manage Linux environments for hosting Hadoop clusters and other data processing systems.

Contact this candidate