Data Engineer Senior

Location:

San Jose, CA

Posted:

June 27, 2024

Contact this candidate

Resume:

Yamini Ravuri Phone Number: 312-***-****

Senior Data Engineer Email: **********@*****.***

LinkedIn: https://www.linkedin.com/in/ravuri-yamini-bb97322b8/

Professional summary:

oI have more than 10 years of experience in analytics, inferential statistics, and data engineering, working with large and complex data sets and platforms across various domains and industries.

oCore competencies include business analytics, product analytics, marketing analytics, statistical modeling, governance, and data engineering.

oExperience in technical skills and tools such as Python, Scala, PySpark, SQL, Linux, HDFS, Hive, SQOOP, Map Reduce, Kafka, Airflow, EMR, EC2, S3, RedShift, Athena, Glue, SNS, SQS, Lambdas, Amazon Bedrock, Step Functions, Snowflake, Databricks, and Tableau.

oStrong experience in Python, and Scala, allowing to develop efficient ETL workflows, implement data validation rules, and build custom data processing scripts.

oProficient in SQL, designed and optimized complex queries for data transformation, aggregation, and analysis.

oExperience in fine tuning Spark jobs, implementing partition strategies, and optimizing Hive queries to deliver high-performance data solutions.

oCreated and maintained Erwin Logical and Physical data model for Teradata data warehouse for many companies.

oOutstanding familiarity with all Teradata tools and utilities for ETL and OLAP application development.

oStrong experience in implementing batch and real-time processing solutions using PySpark and Scala Spark, enabling timely data analysis and insights generation.

oExperienced in optimizing Hive queries and MapReduce jobs for performance, fine-tuning configurations, and implementing caching strategies to improve job execution times.

oProficient in integrating Hive, HDFS, and MapReduce with other Hadoop ecosystem components, such as Pig, Sqoop, and Spark, to streamline data workflows and enable seamless data integration.

oExperience in leveraging EMR security features such as encryption, IAM roles, and VPC settings to protect sensitive data and ensure compliance with security and regulatory requirements.

oSkilled in designing and implementing data pipelines using AWS Glue, automating the extraction, transformation, and loading of data from various sources to target destinations.

oCapable of optimizing AWS Glue ETL jobs for performance and scalability, fine-tuning configurations, and implementing caching strategies to improve job execution times.

oExperienced in utilizing Amazon Simple Storage Service (S3) as a scalable, durable, and secure object storage solution for storing and managing data.

oProficient in performing data analysis and processing using Apache Spark on Databricks, harnessing the distributed computing capabilities for handling large-scale datasets with high performance and efficiency.

oCompetent in scheduling and automating data workflows and machine learning pipelines on Databricks using features such as job scheduling, job clusters, and REST APIs, ensuring timely execution and resource optimization.

oExperienced in leveraging Snowflake as a cloud-based data warehousing platform for storing, processing, and analyzing large volumes of data with high performance and scalability.

oFamiliar with implementing security controls and compliance measures in Snowflake, including encryption-at-rest and in-transit, role-based access control (RBAC), and audit logging, ensuring data security, privacy, and regulatory compliance.

oTechnical & Analytics projects and holding a strong control & prospect on Data Warehousing concepts using Informatica & Stream set as an ETL framework tool along with Oracle, Teradata, Cassandra, SQL, UNIX scripting, Hadoop eco system.

oSkilled in using Git for collaborative software development, including branching, merging, and resolving conflicts, enabling multiple developers to work on the same codebase concurrently.

oSkilled in integrating Git with CI/CD pipelines for automated build, testing, and deployment processes, enabling rapid and reliable delivery of code changes to production environments.

oProficient in leveraging Docker for containerization, Kubernetes for container orchestration, and Jenkins for continuous integration and continuous delivery (CI/CD) pipelines, enabling scalable, automated, and reliable deployment of applications in cloud-native environments.

oExperienced in utilizing Tableau for data visualization and analytics, creating interactive dashboards, reports, and visualizations to derive insights and support data-driven decision-making.

oExperienced in utilizing Jira as a project management tool for Agile software development, facilitating sprint planning, task tracking, and progress reporting.

Technical Skills:

Big Data Ecosystem

HDFS, Hive, Spark, MapReduce, Sqoop, Hadoop, Pig

Languages

Python, Scala, PySpark, Scala Spark, SQL

AWS

EMR, Glue, Athena, Airflow, S3, Lambdas, Step Functions, Databricks, Snowflake

Web Servers

Web Logic, Web Sphere, Apache Tomcat.

Scripting Languages

Shell Scripting, UNIX

Containerization tools

Kubernetes, Docker, OpenShift

Database

Oracle, Microsoft SQL Server, HL7, HER, MySQL, DB2, snowflake, Teradata SQL, RDBMS, MongoDB, Cassandra, HBase, Dynamo DB

IDE & Build Tools

PyCharm, VS Code, IntelliJ, Eclipse, NetBeans, ANT and Maven.

Version Control System

CVS, SVN, GITHUB, Bitbucket

Platforms

Windows and LINUX

Professional Experience:

Client: Globe Life – McKinney, TX Feb 2023 – Present

Role: Senior Data Engineer

oSenior Data Engineer working on data problems related to corporate risks, sanctions screening and fraud programs.

oLeveraged Python for various aspects of the project, including scripting, data manipulation, and automation tasks.

oLeveraged expertise in Python, Pandas, NumPy, and PySpark to build and deploy machine learning models and predictive analytics solutions.

oDesigned and developed data models and pipelines to integrate Foundry with external systems, ensuring seamless data flow and accessibility

oUtilized Spark’s distributed computing capabilities to handle the substantial volume of data flowing between Intuit and its subsidiaries, ensuring high performance and scalability.

oConducted performance tuning and optimization (Partitioning, caching, and parallelism adjustments) of Spark and PySpark jobs to improve efficiency and reduce processing times.

oUtilized Spark SQL for querying and analyzing large datasets, optimizing query performance and resource utilization for efficient data processing.

oReviewed and approved pull requests, providing detailed feedback and guidance to the Predictive AI development team.

o Created and managed data pipelines and workflows on Palantir Foundry, integrating various data sources and ensuring data integrity..

oUtilized EMR for running Spark and PySpark jobs at scale, management of clusters, and allowing focus on data processing tasks rather than infrastructure management.

oIntegrated with Snowflake for data warehousing and analytics, ensuring scalability and flexibility in managing project data pipelines and analytics workflows.

oDeveloped a new data scheme for the data consumption store for the Machine Learning and AI models to quicken the processing time using SQL, Hadoop, and Cloud services.

oUtilized Databricks for collaborative, interactive Spark-based analytics, and using features such as notebook-based development, job scheduling, and integration with other cloud services.

oOptimized EMR cluster configurations and resource allocation for performance and cost-efficiency, leveraging instance types, instance fleets, and spot instances.

oUsing Data flow and python to build dynamic data workflow pipelines to serve various experiments. Lead a team of data scientists and engineers to deploy ML pipelines in production using docker and Kubeflow and write automated test cases to maintain and monitor them.

oUtilized SQL for data querying, transformations, and extracting insights from the data flowing through the AWS Kinesis and Kafka pipelines.

oDesigned and implemented data marts and data warehouse solutions, leveraging dimensional modelling techniques to support complex reporting and analytics requirements, enabling efficient data storage and retrieval for business users.

oDeveloped and maintained data catalogs and metadata repositories using Glue Data Catalog, enabling centralized metadata management and data discovery across AWS services.

oImplemented data quality checks and validation rules using Glue ETL jobs, ensuring data accuracy and consistency in data pipelines.

oImplemented automation solutions for data ingestion, transformation, and loading using ETL (Extract, Transform, Load) frameworks such as Apache Airflow and AWS Glue, streamlining data workflows and reducing manual intervention.

oImplemented monitoring and alerting solutions to ensure the reliability and availability of data pipelines, utilizing AWS CloudWatch for proactive monitoring and troubleshooting.

oUtilized AWS Kinesis and Kafka for real-time data flow management between Intuit and its subsidiaries ensuring secure and managed data transfer.

oProfound knowledge about the architecture of the Teradata database and experience in Teradata Unloading utilities like Fast export.

oDeveloped Jenkins pipelines to automate deployment and rollback processes for AWS Glue jobs, reducing manual effort significantly.

oExperience in UNIX shell scripting for processing large volumes of data from varied sources and loading into databases like Teradata and Vertica.

oUtilized Jira as a primary project management tool for Agile software development, facilitating sprint planning, task tracking, and progress reporting.

oIntegrated CI/CD pipelines with Git to trigger automated builds and deployments upon code commits and pull requests, enabling continuous integration of changes.

oConducted training sessions to educate team members and stakeholders on machine learning and data engineering best practices.

Environment: Python, SQL, Spark, PySpark, Teradata, machine learning, Hive, Kafka, NumPy, EMR, Glue, S3, EC2, Kinesis, Lambda, Step Functions, RedShift, Pandas, Athena, Databricks, Snowflake, Airflow, Amazon Bedrock, GitHub, Jenkins, Docker, Jira

Client: Takeda Pharmaceutical - Easton, PA Sep 2022 – Feb 2023

Role: Senior Data Engineer

oDesigned and implemented robust data pipelines using Spark and Scala for ETL processes, ensuring scalability, performance, and reliability.

oUtilized Maven to build and package Scala projects into JAR files, automating the build process and managing project dependencies.

oDeveloped and deployed machine learning models using Python and AWS services, including Amazon SageMaker and Amazon EC2.

oOptimized Spark jobs for performance and scalability, fine-tuning configurations, and implementing caching strategies to improve job execution times.

oOrchestrated data workflows using Apache Airflow on Databricks clusters, automating and scheduling data pipeline executions for optimal resource utilization.

oLeveraged Databricks notebooks for data querying and validations, utilizing Python for writing validation scripts and ensuring data quality.

oUtilized Delta tables as staging areas for loading data, optimizing data loading processes and facilitating incremental updates for efficient data processing.

oImplemented Snowflake tables for storing structured data, managing schemas, and optimizing resource utilization within Snowflake.

oLeveraged Kafka for real-time data streaming, enabling timely processing and analysis of streaming data from various sources.

oParticipated in agile development processes, providing rapid feedback and iterating on solutions to meet user requirements.

oConducted data analysis and modeling using SQL and PySpark, optimizing performance and ensuring data quality.

oLeveraged AWS Athena for querying data directly from S3 buckets using standard SQL syntax, enabling ad-hoc analysis and exploration of large datasets without the need for managing infrastructure.

oUtilized AWS S3 for storing staging delta tables, optimizing data storage and management with features like versioning and lifecycle policies.

oUtilized AWS DynamoDB for querying and accessing semi-structured data stored in key-value and document formats, using DynamoDB API and Query Language for efficient data retrieval and manipulation.

oInstallation and maintenance of PostgreSQL Databases migration to AWS Aurora.

oUtilized Unity Catalog on Databricks for metadata management, enabling efficient data discovery, documentation, and governance.

oManaged Hive tables for specific data sets, ensuring compatibility and interoperability with existing systems and processes within organization.

oExperience in AWS cloud infrastructure database migrations, PostgreSQL and converting existing ORACLE and MS SQL Server databases to PostgreSQL, MySQL and Aurora.

oContainerized Apache Airflow using Docker, streamlining deployment processes, and enabling efficient management of Airflow instances across multiple environments, ensuring scalability of data workflows.

oImplemented data encryption and security measures using HashiCorp vault, ensuring data confidentiality and compliance with security standards.

oImplemented Git hooks to automate pre-commit and post-commit actions such as code formatting, and running unit tests, ensuring code quality and consistency across codebase.

oCreated lightweight tags in Git to label specific commits, indicating significant changes, bug fixes, or feature enhancements introduced in the codebase.

oCollaborated with stakeholders to gather requirements, design data models, and define data ingestion and processing workflows, ensuring alignment with business objective.

oCreated and maintained technical documentation in Confluence for projects, including architecture, table definitions, API specifications, and user guides.

oUtilized Scrum framework to plan and execute iterative development sprints, defining user stories, estimating effort, and prioritizing backlog items based on business value and customer feedback.

Environment: Python, Scala, SQL, PySpark, Scala Spark, Aurora, Databricks, Airflow, Snowflake, EMR, S3, EC2, Lambdas, Athena, Oracle DB, Dynamo DB, Kafka

Client: Chubb Group of Insurance Co - India (IBM) Mar 2019 – Jul 2022

Role: Data Engineer

oImplemented a generic ETL framework with high availability for bringing related data for Hadoop & Cassandra from various sources using Spark.

oImplemented various data modeling for Cassandra.

oImplemented Apache Drill on Hadoop to join data from SQL and No SQL databases and store it in Hadoop.

oImplemented Spark using PySpark and Spark SQL for faster testing and processing of data.

oImplemented partitioning, dynamic partitions, and buckets in the hive for efficient data access.

oCreated and implemented various shell scripts for automating the jobs.

oExperience in extracting source data from Sequential files, XML files, and CSV files, transforming and loading it into the target data warehouse.

oWorked on importing data from MySQL DB to HDFS and vice-versa using Sqoop to configure hive meta store with MySQL, which stores the metadata for hive tables.

oImplemented Map Reduce jobs in Hive by querying the available data and designed the ETL process by creating high-level design documents including the logical data flows, source data extraction process, database staging, the extract creation, source archival, job scheduling, and error handling.

oInvolved in the process of data acquisition, data pre-processing, and data exploration in Scala.

oUsed UDFs to implement business logic in Hadoop by using Hive to read, write, and query the Hadoop data in HBase.

oImported and exported streaming data into HDFS using stream processing platforms like Flume and Kafka messaging system.

oInvolved with writing SQL queries using Joins and Stored Procedures using Maven to build and deploy the applications in Boss application Server in Software Development Lifecycle Model.

oUsed Cloudera Manager for continuous monitoring and managing of the Hadoop cluster for working updates, patches, and version upgrades as required.

oExperience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.

oUsed Oozie workflow engine to run multiple Hive and Pig Scripts with the help of Kafka for the real-time processing of data to navigate through data sets in the HDFS storage by loading Log File data directly into HDFS.

oDeveloped data pipelines using Sqoop, Pig, and Hive to ingest customer data into HDFS to perform data analytics.

oImplemented Spark Core in Scala to process data in memory.

oInvolved in creating Spark applications in Scala using cache, map, and reduceByKey functions to process data.

Environment: Python, Scala, SQL, HDFS, Hive, Pig, Spark, PySpark, MapReduce, Sqoop, Data Lake, PowerShell, MongoDB, HBase, Oracle DB, MySQL, MS SQL, UNIX

Client: Synopsis – India Jan 2017 – Mar 2019

Role: Data Engineer

oInstalled and configured Hadoop MapReduce and HDFS and developed multiple MapReduce jobs in Java for data cleaning and preprocessing.

oUtilized Spark SQL for interactive querying and analysis of structured data, enabling quick decision-making and actionable insights for stakeholders.

oWrote MapReduce jobs using Pig Latin.

oImported data using Sqoop to regularly load data from MySQL to HDFS.

oExperience writing Hive queries for data analysis to meet business requirements.

oExperience in managing and reviewing Hadoop log files.

oInvolved in loading data from the UNIX file system to HDFS and loaded and transformed large sets of structured, semi-structured, and unstructured data.

oImplemented various MapReduce Jobs in custom environments and updating them to Base tables by generating hive queries.

oInvolved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.

oExperience installing, configuring, and using Hadoop ecosystem components like Hadoop, MapReduce, HDFS, HBase, Oozie, Hive, Kafka, Zookeeper, Spark, Sqoop, and Flume.

oResponsible for loading customer data and event logs into HBase using Java API.

oUsed Flume to collect log data from different resources and transfer the data type to Hive tables using different Ser Des to store in JSON, XML, and Sequence file formats.

oUnderstanding of data integration programs like Apache NiFi and Apache Kafka.

oExperience with data modeling, database design, and SQL database performance tuning for efficient data processing and analysis.

oExperience with NoSQL databases such as HBase, Cassandra, and MongoDB for Hadoop data storage and retrieval.

oDesigned and built specific databases for data collection, tracking, and reporting.

oDesigned, coded, tested, and debugged custom queries using Microsoft T-SQL and SQL Reporting Services.

Environment: Python, SQL, HDFS, Hive, MapReduce, Sqoop, Kafka, MySQL, Spark, PySpark, Impala, Pig, Nifi, Cassandra, HBase, MongoDB, Flume.

Client: Logica - India Oct 2014 – Jan 2017

Role: SQL Developer

oInformation Gathering, Analyzing the requirements of the project. And took part in requirement gathering from the client.

oExperience in developing optimized SQL using best practices.

oExpertise in writing dynamic-SQL, complex stored procedures, functions and views.

oHands on Experience in performance tuning using execution plans, SQL Profilerand DMVs.

oExperience in creating SSIS packages using with Error Handling. Experience in enhancing and deploying the SSIS Packages from development server to production server.

oExperience in working with row, partially blocking and fully blocking transformations and adept at performance tuning of SSIS Packages.

oAdept at capturing the execution errors in SSIS for analysis purposes.

oExperience in sync processes between two databases using SSIS. Experience in importing/exporting data between different sources like Oracle/XML/Excel etc. using SSIS/DTS utility.

o Experienced in working with C#.Net in SSIS.

oExperience in deploying created reports to report server

oCreated objects like tables and views and developed SSIS packages to load data.

oImplemented Incremental Loading into the target tables/views in SSIS.

oCreated SSIS packages for moving data between databases.

oInvolved in maintaining the Error-Logging for the SSIS packages. The data is moved with one to many relationships between tables and many to one relationship between tables.

oCreated Jobs, Alerts using SQL Server Mail Agent in SSIS.

Environment: T-SQL, C#.Net, VB.Net, SQL Server Integration Services (SSIS), SQL Server Reporting services (SSRS)

Microsoft SQL Server 2008, Microsoft Team Foundation Server.

Contact this candidate