Navasahitha Inuganti
Big Data Engineer
Email: ************@*****.***
Phone: +1-314-***-****
https://www.linkedin.com/in/navasahitha-inuganti-4432ab25a/
PROFESSIONAL SUMMARY:
●Around 8+ years of professional IT experience involving project development, implementation, deployment, and maintenance using Big Data technologies in designing and implementing complete end-to-end Hadoop based data analytical solutions.
●Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
●Good understanding of Spark Architecture with Databricks, Structured Streaming.
●As a Data Engineer, responsible for data cleaning, data modeling, data migration, design, and ETL pipeline preparation for both cloud and On-prem platforms.
●Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks.
●Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
●Created Snowflake Schemas by normalizing the dimension tables as appropriate and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
●Hands on experience in test driven development (TDD), Behavior driven development (BDD) and acceptance test driven development (ATDD) approaches.
●Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
●Experience with Databricks as a platform in developing multiple Python, PySpark, Scala and SQL scripts to perform transformations/operations.
●Extensive experience in Text Analytics, generating data visualizations using SQL, Python and creating dashboards using tools like Tableau, PowerBI.
●Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy. Experience in working on creating and running docker images with multiple microservices.
●Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g., EC2, Redshift, EMR), Hadoop, Python, Spark and effective use of Azure SQL Database, Hive, SQL and pyspark to solve big data type problems.
●Strong experience in Microsoft Azure Machine Learning Studio for data import, export, data preparation, exploratory data analysis, summary statistics, feature engineering, Machine learning model development and machine learning model deployment into Server system.
●Expertise in transforming business resources and requirements into manageable data formats and analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
●Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Sqoop, Kafka, Spark with Cloudera distribution.
●Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming.
●Experience in Data Pipelines, phases of ETL, ELT data process, converting Bigdata/unstructured data sets (JSON, log data) to structured data sets.
●Experience in Spark-Scala programming with good knowledge on Spark Architecture and its In-memory Processing
●Experience in designing and developing applications in Spark using Python to compare the performance of Spark with Hive.
●Working in Kafka streaming environment for setting up configurations based on size of data, Processing large sets of data effectively, Reading/writing data from/to various databases
●Develop and implement database solutions using Azure SQL Data Warehouse and Azure SQL Database
●Experienced in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.
●Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
Technical Skills:
Cloud Technologies
AWS, Azure
Databases & Warehouses
Oracle 11g, MySQL, HBASE, SQL-Server, Teradata, MongoDB, Snowflake
Programming / Query Languages
Python, Spark, PySpark, Java, SQL, Scala, PL/SQL, Linux shell scripts
Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools
Databricks, Snowflake, Kafka, HDFS, Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, Oozie, Zookeeper, Azure Data Explorer, Azure HDInsight, NiFi, Linux, Tableau, Power BI, SAS, Crystal Reports.
ETL Tools
Informatica, Talend
Data Orchestration Tools
Airflow, Oozie, Control-M
WORK EXPERIENCE:
Equifax, Atlanta, Georgia Jan 2021 - Present
Sr. Data Engineer
Project Description:
Setting up and optimizing data pipelines to collect and process large amounts of financial and credit data from various sources, such as trading systems, customer transactions, credit bureau data, and market data providers. Ensuring the security and integrity of sensitive financial and credit data such as pay slips, implementing data encryption and access controls. Helping to integrate credit data and analytics into the bank's systems and services, such as fraud detection, credit risk management and customer onboarding processes.
Responsibilities:
●Configured, supported, and maintained all network, firewall, storage, load balancers, operating systems, and software in AWS EC2.
●Worked on Cloudera distribution and deployed on AWS EC2 Instances.
●Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.
●Build and configured the AWS infrastructure and setup the infrastructure using resources EC2, S3, Dynamo DB, Cloud watch, EMR, Auto scaling.
●Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.
●Developed Spark scripts by using Scala and python as per the requirement.
●Involved in running Hadoop streaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.
●Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.
●Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM, Cloud formation) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
●Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances.
●Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
●Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
●Worked on AWS hosted Databricks environment and used spark structured streaming to consume the data from Kafka topics in real time and perform merge operations on delta lake tables.
●Used Airflow as scheduling and orchestration tool of our data pipelines.
●Design, Development, Implementation ETL process to support CDC- Change Data Capture on Databricks platform.
●Used Snowflake extensively to do the ETL operations and imported the data from Snowflake to S3 and S3 to Snowflake.
●Automated the data pipelines using Airflow and scheduled each data pipeline at scheduled time as per the business requirements.
●Managed AWS EC2 instances utilizing S3 and Glacier for our data archiving and long-term backup and UAT environments as well as infrastructure servers for GIT.
●Involved in data processing using an ETL pipeline orchestrated by AWS Data Pipeline using Hive.
●Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc
●Experience in writing queries in SQL and R to extract, transform and load (ETL) data from large datasets using Data Staging.
Environment: SQL, Python, PySpark, Spark, Airflow, AWS, ETL, GIT, Databricks, Snowflake, Hive, Kafka, CI/CD, Jenkins, EC2, S3, EMR, Dynamo DB, Lambda, Data analysis, Glue
Fidelity Investments - Boston, Massachusetts June 2018 – Dec 2020
Senior Data Engineer
Project Description:
To support project needs by designing, building, and maintaining data pipelines and ETL processes using Azure Data Factory and other big data analytic tools including Pig, hive, HBase, Spark and Sqoop. Implementing data governance and security policies to ensure data quality and protect sensitive information using Azure Policy and Azure Security Center. Building and maintaining data warehouse, data mart or other data structures using MS-SQL that support the company's reporting and analytics needs.
Responsibilities:
●Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application HD Insights, Azure Monitoring, Key Vault, Azure Data Lake.
●Worked extensively on running spark jobs on Azure HD Insights environment.
●Used Spark as Data processing framework and have worked on performance tuning of the production jobs.
●Ingested the data from ms-sql server to Azure data storage.
●Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.
●Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
●Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
●Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
●Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
●Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
●Implemented auto balance and data reconciliation measures during data receipt, stage load and production load process.
●Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
●Perform validation and verify software at all testing phases which includes Functional Testing, System Integration Testing, End to End Testing, Regression Testing, Sanity Testing, User Acceptance Testing, Smoke Testing, Disaster Recovery Testing, Production Acceptance Testing and Pre-prod Testing phases.
●Developed Python scripts to parse JSON documents and load the data into database.
●Generating various capacity planning reports (graphical) using Python packages like NumPy, matplotlib.
●Analyzing various logs that are been generating and predicting/forecasting next occurrence of event with various Python libraries. Created Snow pipe for continuous data load. Load the data from Azure blob storage.Developing Spark/PySpark/Python notebooks to transform, Partition and organize files in ADLS
●Working on Azure Data bricks to run Spark-Python Notebooks for developing logics for various transformations.
●Using Data bricks utilities called widgets to pass parameters on run time from ADF to Data bricks.
●Created Triggers, PowerShell scripts and the parameter JSON files for the deployments.
●Reviewing individual work on ingesting data into azure data lake and provide feedbacks based on reference architecture, naming conventions, guidelines and best practices.
●Implemented End-End logging frameworks for Data factory pipelines.
Environment: SQL, Python, Spark, PySpark, ETL, Azure, Databricks, ADF, Data Lake, MS SQL server, Teradata, Snowflake, HDFS, RDBMS, Sqoop, Hive,
Chewy Dania Beach - FL May 2016 – June 2018
Data Engineer
Responsibilities:
●Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
●Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
●Written Map Reduce programs, Hive UDFs in java. Also developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
●Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
●Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis.
●Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
●Created many Sparks UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark SQL.
●Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
●Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
●Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift.
●Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
●Done data validation between data present in Data Lake and S3 bucket.
●Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
●Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
●Used Kafka for real time data ingestion. Created different topic for reading the data in Kafka.
●Created database objects like Stored Procedures, UDFs, Triggers, Indexes and Views using TSQL in both OLTP and Relational data warehouse in support of ETL.
●Developed complex ETL Packages using SQL Server 2008 Integration Services to load data from various sources like Oracle/SQL Server/DB2 to Staging Database and then to Data Warehouse.
●Created report models from cubes as well as relational data warehouse to create ad-hoc reports and chart reports.
Wipro - India March 2014 – Jan 2016
Data Engineer
Responsibilities:
●Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie and Impala.
●Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
●Involved in loading data from UNIX file system to HDFS. Installed and configured Hive and written Hive UDFs. Importing and exporting data into HDFS and Hive using Sqoop.
●Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
●Worked hands on with ETL process using Informatica.
●Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
●Extracted the data from Teradata into HDFS using Sqoop. Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
●Wrote various data normalization jobs for new data ingested into Redshift.
●Migrated on premise database structure to Confidential Redshift data warehouse.
●Analyzed the data by performing Hive queries and running Pig scripts to know user behavior. Exported the patterns analyzed back into Teradata using Sqoop.
●Partner with technical and non-technical resources across the business to leverage their support and integrate our efforts.
●Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
●Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors.
Education:
Bachelor of Technology in Computer Science, KL University, India, 2014