Data Engineer Big

Location:

Hayward, CA

Posted:

April 23, 2025

Contact this candidate

Resume:

SAITEJA KOYA

Data Engineer

**************@*****.*** +1-510-***-**** Hayward, CA Linkedin

PROFESSIONAL SUMMARY:

Overall 4+ years of experience as a Data Engineer in the fields of Database Development, Data Warehousing and Big Data Technologies. Demonstrated skills in utilizing various programming languages such as Python,Scala, and T-SQL for versatile data processing and analysis. Experienced in big data ecosystems, including Hadoop, Apache Spark, Apache Kafka, HBase, Kinesis Firehose, DynamoDB, MapReduce, Oozie, Hive, HDFS, and SQOOP, to facilitate efficient management of large-scale data processing. Worked on dimensional data modeling using star and snowflake schemas, including slowly changing dimensions. Leveraged comprehensive knowledge of cloud platforms, particularly AWS (EC2, S3, Lambda, EMR, Cloud Watch, Glue, Kinesis, Redshift) and Azure (Data Factory, Data Bricks) to guarantee seamless integration and scalability in cloud environments. TECHNICAL SKILLS:

Languages Python, SQL, PL/SQL, HTML, CSS, PySpark, Scala, C, C++, Shell Script. ETL and Database

Management

Azure Data Factory (ADF), AWS Glue, Apache Airflow, Databricks (PySpark, Scala), SSIS, Apache Kafka, Informatica.

Data Analysis and

Visualization

Power BI, Python, Excel, Azure Data Factory, DataBricks, SSMS and Toad, Tableau, SSRS

DevOps & CI/CD Azure DevOps, GitHub Actions, Terraform Databases Oracle, MySQL, PostgreSQL, MongoDB, Snowflake, T-SQL, DynamoDB Cloud Platforms AWS (EC2, ECS, Lambda, S3, RDS, IAM), Azure (App Service, SQL Database, Storage, Synapse, ADF), GCP (Big Query) Big Data ecosystems Cloudera Distribution, HDFS, Yarn, MapReduce, PIG, SQOOP, Kafka, Hbase, Hive, Cassandra, Spark, Storm, Scala, Impala, Hadoop, BigQuery Packages: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow. Tools PyCharm, Eclipse, Visual Studio, Postman, Kubernetes, Docker Skills Problem Solving, Analytics, Virtualization, Critical Thinking, Adaptable, Quality Assessment

WORK EXPERIENCE:

Client: Denver Health, CO, USA July 2024 –Present

Role: Data Engineer

Responsibilities:

Developed and streamlined ETL processes using Azure Data Factory (ADF), Databricks (PySpark, Scala), and Delta Lake, enabling quick ingestion and processing of clinical data and patient histories from various healthcare systems.

Designed real-time data streaming applications with Apache Kafka and Azure Event Hubs, reducing latency by 50% for live updates of laboratory results, medication orders, and patient admissions.

Implemented Snowflake and Azure Synapse Analytics as the core data warehouse, optimizing performance with advanced partitioning and clustering techniques for efficient patient data analysis.

Automated data orchestration tasks with Apache Airflow and ADF, ensuring consistent synchronization of EHR updates, medical claims, and clinical reports with 99.9% reliability.

Built and maintained a secure healthcare data lake on Azure Data Lake Gen2, integrating patient care data, claims, and medical records to support unified clinical analytics.

Applied RBAC, column-level encryption, and HIPAA-compliant data masking to enhance data governance and ensure the protection of sensitive patient information across healthcare systems.

Integrated machine learning pipelines with Azure ML for predictive healthcare analytics, supporting proactive care initiatives such as patient readmission forecasting and no-show prediction.

Optimized cloud infrastructure with Terraform (IaC), automating deployment and scaling of healthcare data resources, achieving significant cost savings.

Developed API-based data services using FastAPI, enabling seamless integration of real-time patient data with clinical systems, improving care provider decision-making.

Monitored and optimized data pipeline performance with Azure Monitor, Prometheus, and the ELK Stack (Elasticsearch, Kibana), improving data processing reliability.

Collaborated in an Agile environment, working closely with clinical IT and data analytics teams to meet hospital goals and deliver timely, data-driven solutions.

Delivered insightful business intelligence reports and visualizations using Power BI and Tableau, providing actionable insights to hospital administrators and clinical staff.

Systel Inc, India Nov 2020 – Dec 2022

Role: Data Engineer

Responsibilities:

Design, develop, and create modern data solutions that enable data visualization through AWS cloud resources. Assess the impact of new releases on existing business activities by examining the condition of applications in production.

Construct Connections, Datasets, and Pipelines in AWS Glue to create ETL pipelines that extract, transform, and ingest data from different sources, including write-back tools, Amazon Redshift, S3, and RDS.

Configured Spark Streaming to consume real-time data from Kafka and save it to Amazon S3 for downstream processing.

Created ETL jobs using PySpark, taking advantage of the DataFrame API and Spark SQL API. Utilized Spark for multiple transformations and operations, outputting result data to S3 before outputting it to the final database, Snowflake.

Developed Spark jobs to ingest, manipulate, and consolidate data from multiple file formats utilizing PySpark and Spark-SQL. Investigated and re-factored these jobs to understand client usage patterns.

Responsible for EMR Spark cluster monitoring and debugging and forecasting their size based on workloads.

Implemented and scaled out data pipelines with Amazon Redshift and AWS Glue, efficiently processing large-scale manufacturing data sets and delivering real-time insights for improved decision-making and business operations.

Merged real-time data streaming using Amazon Kinesis and Kafka to enable continuous tracking of production processes and machinery, helping forecast maintenance needs ahead of time, lowering downtime.

Built quick and centralised data streams with Apache Airflow and AWS Glue Workflows to process data in a timely manner and efficiently, improving pipeline performance and availability.

Leveraged Amazon S3 to store and manage large volumes of unstructured manufacturing data from sensors and IoT devices and made it readily available for analysis, improving system performance.

Streamlined and designed Spark-based data transformation processes with PySpark and EMR, converting data from various sources like CSV, Parquet, and JSON, making it accessible for business intelligence teams to develop actionable insights.

Enhanced operational efficiency with real-time data analytics to predict machine breakdowns and plan production to maximize, reducing unplanned downtime and optimizing asset performance.

Used container platforms like Docker and Kubernetes (EKS) to deploy and run Spark jobs on AWS infrastructure for better resource utilization and scalability.

Implemented data governance processes using AWS Glue Data Catalog and AWS Lake Formation to monitor data lineage and meet compliance requirements, simplifying audit trails and maintaining data integrity. HCL Technologies, India Sep 2019 - Nov2020

Role: Jr. Data Engineer

Responsibilities:

Created ETL scripts for data acquisition and transformation using Informatica and Talend.

Utilize a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL to extract, transform, and load data from source systems into Azure Data Storage services, specifically Azure Analytics Data Lake, Data is ingested into Azure Services, including Azure Data Lake, Azure Storage, Azure SQL, and Azure Data Warehouse, and processed in Azure Databricks.

Responsible for estimating cluster sizes while monitoring and troubleshooting Spark Databricks clusters.

Set up and configured Hive, Pig, Sqoop, and Oozie on a Hadoop cluster while benchmarking clusters for internal applications.

Employed Spark for preprocessing to eliminate missing data and create new features during data transformation.

Executed business logic in Hadoop through User Defined Functions (UDFs) in Hive to interact with Hadoop data in HBase.

Designed and developed proof-of-concepts (POCs) in Spark utilizing Scala, comparing performance metrics with Hive and Oracle SQL.

Supported to create and tune ETL pipelines using Azure Data Factory and Apache Spark to move and process data into cloud storage like Azure Data Lake, making data more accessible and usable by teams.

Worked with lead engineers to build SQL queries to validate data to ensure the data was correct and properly transformed when moving to other systems.

Assisted in monitoring Azure Databricks clusters to ensure that they were adequately sized and running well to support the load of our data pipelines.

Assisted in facilitating the installation and upkeep of data storage solutions like Azure Data Lake and SQL databases to ensure that everything was running efficiently and securely.

Cooperated with the team to perform queries using Hive and HBase, which allowed us to improve how we retrieve large data sets and make data retrieval faster for the business.

Assisted document the data pipeline procedures and rules of transformation to make sure that the entire team could understand and work with the workflows efficiently

Education:

California State University, East Bay — M.S. in Business Analytics Jan 2023 – Dec2024

SRKR Engineering College, India — Bachelor in Technology Aug 2016 – Sept 2020 Certifications:

AWS Certified Associate Data Engineer.

Contact this candidate