Sr Cloud Data Engineer

Location:

Garland, TX, 75044

Posted:

April 21, 2025

Contact this candidate

Resume:

Karthik Kondapalli

Cloud Data Engineer

*******************@*****.***

+1-234-***-****

https://www.linkedin.com/in/karthik-kondapalli-2b1513164/

PROFESSIONAL SUMMARY:

Architected, developed, and optimized robust data solutions over 10 years as an AWS Data Engineer, specializing in scalable data pipelines and data migration.

Expertly leveraged a comprehensive suite of AWS services, including DynamoDB, S3, Athena, Glue, Lambda, ECS, Glue Data Quality, EventBridge, Redshift, Machine Learning, OpenSearch, and RDS for diverse data engineering needs.

Demonstrated strong proficiency in data modeling (Star and Snowflake schemas), distributed SQL concepts (Presto, Hive), and Big Data technologies like PySpark and Redshift to efficiently manage and process large datasets.

Designed and implemented fault-tolerant APIs using RESTful standards and GraphQL, utilizing Swagger/OpenAPI for effective API design and documentation.

Successfully implemented advanced anomaly detection systems and data quality frameworks, incorporating AI/ML methodologies and a practical understanding of Large Language Models (LLMs) to ensure data integrity.

Managed the end-to-end data lifecycle, from ETL development (including SSIS) and data warehousing to efficient data archiving in data lakes (S3).

Proficiently utilized monitoring and observability tools like Prometheus and Grafana to ensure system reliability, implement proactive alerting, and maintain system health.

Experienced with infrastructure automation tools such as AWS CloudFormation, Terraform, and Ansible for streamlined deployment and management of data solutions within Unix/Linux environments using Bash scripting.

Developed insightful data marts and visualizations using tools like QuickSight, LookerStudio, Tableau, and Power BI, with proven ability to work effectively across both AWS and GCP cloud platform.

Highly knowledged in developing various reports, dashboards, data preparation and data visualization using QuickSight, LookerStudio, Tableau, PowerBI.

TECHNICAL SKILLS:

Cloud Environment: AWS, Google Cloud Platform (GCP)

ETL Tools: Alteryx, Informatica

Visualization Tools: Tableau, Power BI

Big Data Ecosystems: Hadoop, Spark, Hive, Kafka, Sqoop, Oozie, Flume, Airflow

Scripting Language: Python, Java, SQL, Scala, Shell, Bash

NoSQL Database: MongoDB, Cassandra, Redis, Neo4j, Apache HBase

Database: Oracle, MySQL, PostgreSQL, MS SQL SERVER, Google Cloud SQL, Snowflake

Version Control: Git, Bitbucket

Application Server: Apache Tomcat 5.x 6.0, JBoss 4.0

Operating Systems: Windows, Linux, Unix

CERTIFICATION:

Google Cloud Certified Professional Data Engineer

AWS Certified Solutions Architect Associate

Oracle Certified Associate Oracle Database SQL

PROFESSIONAL EXPERIENCE:

CVS Health, Irving TX August 2023 – Present

Senior Data Engineer

Responsibilities:

Architected and optimized data pipelines using Python (PySpark) on AWS EMR, achieving a 50% improvement in data transformation efficiency for datasets exceeding 10TB weekly.

Reduced operational costs by 40% by developing and maintaining serverless applications with AWS Lambda and S3 for streamlined data processing workflows handling over 5 million daily events.

Ensured 95% on-time data delivery for critical business requirements by orchestrating complex workflows with Apache Airflow, managing over 20 key data integrations.

Automated AWS infrastructure provisioning and management using Terraform and CloudFormation, cutting setup time by 40% and managing over 100 cloud resources.

Boosted data transformation speed by 35% by leveraging Databricks with Apache Spark for processing and analyzing over 500 GB of data daily.

Improved operational efficiency by 30% by implementing Kinesis for real-time streaming of over 1 million data records per hour for timely analysis and decision-making.

Enhanced trust in analytics by 40% by developing and implementing data quality frameworks that monitor over 99.9% data accuracy across key datasets.

Streamlined development and deployment processes by 35% by automating infrastructure management with Ansible and implementing CI/CD pipelines with Jenkins, resulting in bi-weekly release cycles.

Environment: Python (PySpark), AWS, Dbt, Apache Spark, Apache Hadoop (HDFS), Apache Airflow, Terraform, Ansible, Docker, Kubernetes, Snowflake, Databricks (Apache Spark SQL, DataFrames, Datasets), Kafka, GIT, PyTorch, TensorFlow, Flink, MongoDB, Jenkins

North Texas Tollway Authority, Dallas, TX May 2022 – July 2023

Data Engineer

Responsibilities:

Designed, developed, and currently maintain over 15 scalable and efficient data pipelines on AWS, ensuring reliable and timely data delivery for critical business needs.

Leveraged Java and Scala in conjunction with Kafka to process and stream large datasets (exceeding 5 TB daily) with low latency for real-time analytics.

Efficiently managed and stored data utilizing a range of AWS services including S3, Redshift, RDS, and DynamoDB, optimizing for cost and performance across over 10 key data repositories.

Developed and optimized over 50 complex SQL queries for data retrieval and reporting on AWS Redshift and RDS, improving query performance by an average of 20%.

Collaborated effectively with a team of 5+ data scientists, analysts, and engineers to understand and support diverse business requirements with robust and scalable data solutions.

Implemented and enforced best practices for data governance, security (IAM, encryption), and compliance (e.g., GDPR, HIPAA where applicable) across all data pipelines and storage solutions on AWS.

Proactively troubleshoot and resolved over 30 critical issues related to data pipelines and AWS infrastructure, minimizing downtime and ensuring business continuity with an average resolution time of under 2 hours.

Environment: Java, Scala, Kafka, AWS (S3, Redshift, RDS, DynamoDB, IAM, encryption), SQL.

Walmart Sunnyvale, CA September 2020 – April 2022

Cloud Data Engineer

Responsibilities:

Independently led analytical development, extracting over 50 novel insights from Medical Data encompassing over 10 million patient records, directly informing 3 key strategic decisions communicated to senior management.

Ensured 99.9% completeness, critical evaluation, correctness, and integrity across over 200 analyses and datasets derived from diverse RWD sources (EHRs, claims, wearables), impacting the reliability of crucial research outcomes.

Established, maintained, and enforced data pipeline standards across 15 critical pipelines, managing the codebase of over 500 scripts using GitHub and Databricks notebooks, resulting in a 20% reduction in pipeline errors.

Designed and optimized data pipelines leveraging AWS services (S3 storing over 50 TB of RWD, ECS for containerized processing, EMR clusters processing over 1 TB daily, Lambda for event-driven tasks, Glue ETL jobs reducing processing time by 30%) and data technologies (Trino/Presto querying across petabytes, Parquet for efficient storage, Hudi/Iceberg for data lake management).

Developed and optimized over 100 complex SQL queries for data retrieval and reporting from AWS Redshift and RDS databases containing over 20 billion records, improving query performance by an average of 25%.

Proficiently utilized R (including Posit Connect for deploying 10+ data science applications) and Python libraries (PyTorch for training 3 deep learning models, Scikit-Learn for 15+ machine learning models, SciPy for statistical analysis on datasets exceeding 10GB) for analytical development and rapid prototyping of over 20 proofs-of-concept from RWD.

Managed vendor processes for 3 external data providers and the integration of their data into our Snowflake environment (housing over 500 million records), ensuring seamless data flow and resolving over 10 data reformatting issues monthly.

Collaborated closely with 7+ cross-functional team members (biostatistics, data science, cloud operations) to design and implement over 10 data initiatives based on their use cases, while establishing and implementing 5 key data governance practices impacting over 100 internal users.

Environment: AWS EMR, SparkSQL, PySpark, Spark MLlib, AWS Glue, Spark, AWS S3, AWS EC2, AWS Lambda, AWS Kinesis, Scala, Hive, HBase, Pig, Sqoop, Docker, Kubernetes, Ansible, CloudFormation, AWS SQS, AWS SNS, Amazon Redshift, Tableau, XML Pandas, NumPy, Erwin, Agile, Scrum, JIRA, AWS IAM.

Cardinal Health, Dublin, OH January 2019 – August 2020

Data Engineer

Responsibilities:

Built and maintained over 10 critical Data pipelines and ELT processes for efficient Data ingestion and transformation within GCP, coordinating tasks among a team of 5 engineers.

Designed and implemented a multi-layered Data Lake architecture and designed star schema for optimal querying in BigQuery, supporting analytics for over 5 key business domains.

Utilized g-cloud functions with Python to automate the loading of over 10,000 on-arrival CSV files daily from GCS buckets into BigQuery with near real-time latency.

Processed and loaded over 5 million bound and unbound data records daily from Google Pub/Sub topics into BigQuery using Cloud Dataflow with Python, ensuring timely data availability for downstream processing.

Designed and implemented over 5 data pipelines with Apache Beam, KubeFlow, and Dataflow, orchestrating over 20 jobs into GCP for automated data processing workflows.

Developed and successfully demonstrated the POC for migrating a 50TB on-premise data warehouse workload to Google Cloud Platform using GCS, BigQuery, Cloud SQL, and Cloud DataProc, proving feasibility and scalability.

Designed, developed, and implemented high-performing ETL pipelines using the Python API (PySpark) of Apache Spark, processing over 2 TB of data daily.

Contributed to a GCP POC focused on migrating over 10 critical data sources and 5 key applications from on-premise infrastructure to Google Cloud, validating migration approaches and timelines.

Managed and implemented IAM roles in GCP for over 20 team members and service accounts, ensuring secure and controlled access to sensitive data and resources.

Created and managed over 15 firewall rules to securely access Google DataProc clusters from various machines, enabling efficient collaboration and development.

Processed and loaded over 3 million bound and unbound data records daily from Google Pub/Sub topics into BigQuery using Cloud Dataflow with Python, maintaining data freshness for real-time reporting.

Established and managed over 20 GCP Firewall rules to control ingress and egress traffic to and from VM instances based on specific configurations and implemented GCP Cloud CDN to deliver content from over 10 global cache locations, drastically improving user experience and reducing latency by an average of 30%.

Environment: GCP, Data pipelines, ELT, BigQuery, g-cloud functions, Python, GCS, Google Pub/Sub, Cloud Dataflow, Apache Beam, KubeFlow, Apache Spark, PySpark, Cloud SQL, Cloud DataProc, IAM roles, GCP Firewall rules, GCP Cloud CDN, SQL.

CGS, India. July 2017 – December 2018

Junior Data Engineer

Responsibilities:

Managed a comprehensive AWS environment, including S3 for storing over 50 TB of data, provisioned and maintained over 20 EC2 instances for compute, administered 10+ RDS databases ensuring 99.9% uptime, and deployed over 15 Lambda functions for serverless data processing.

Orchestrated over 10 complex data workflows using AWS Step Functions, ensuring timely and reliable execution of critical data pipelines for 5 key business processes.

Leveraged Amazon EMR clusters, comprising up to 30 nodes, for large-scale data processing and analytics on datasets exceeding 100 TB.

Utilized Kinesis Firehose to efficiently deliver over 1 billion processed data records daily to Amazon S3 and Redshift with near real-time latency.

Designed and implemented the User Interface and business logic for a customer registration and maintenance system that handled over 10,000 daily user interactions.

Integrated over 15 web services, facilitating seamless data exchange and processing across diverse server environments, improving data accessibility by 40%.

Designed and developed over 20 SOA services using Web Services, enabling interoperability and data sharing between disparate applications.

Created, developed, and maintained over 50 database objects (PL/SQL packages, functions, stored procedures, triggers, views, materialized views) to extract and transform data from over 10 different sources, supporting critical reporting and analytics.

Environment: AWS (S3, EC2, IAM, RDS, Step Functions, Data Pipeline, Glue, Kinesis Firehose, EMR, Lambda), PL/SQL, SQL*LOADER, UNIX Scripts, Informatica Power Center Designer, Oracle OLTP, BCP, MS Access, Excel, SSIS, Web Services, SOA.

Saven Technology Ltd, India. January 2016 – June 2017

Data Analyst

Responsibilities:

Implemented SQL-based data governance policies and controls to ensure data security, privacy, and compliance with industry regulations, reducing data breaches by 30%.

Assigned role-based access controls in Tableau Server, ensuring data security and compliance, decreasing unauthorized access incidents by 20%.

Implemented custom ETL processes using Alteryx for data preparation and cleansing, resulting in 50% faster data processing times.

Developed and created analysis and engaging dashboards using Tableau and Excel, providing relevant KPIs and trends within the company and the market to first-tier and second-tier managers

Utilized Tableau to connect with databases such as SQL and cloud storage, analyzing and visualizing data for market research in real-time

Conducted data collection, quality assurance, data cleaning, and standardization processes across datasets from multiple integrated systems, enhancing the accuracy of subsequent market analysis and reporting scopes

Optimized SQL queries and fine-tuned tables, reducing query response time by 10 seconds per query and improving processing speed, enabling quicker market intelligence reports.

Environment: SQL, Tableau Server, Alteryx, Tableau, Excel, SQL databases, cloud storage.

EDUCATION:

Bachelor of Computer Science Engineering JNTU, India.

Contact this candidate